I trained a recurrent model to generate Twitter bios. Some of them are pretty good - perhaps not unreasonably good, but pretty good.
We back the United States to NLP pathology for Canada!
------------------------------
Sr. Co-Queer | Activist - jesus #MachineLearning
------------------------------
Artist. Champion. I am championing the USA Blue secrets and teaching software practice. Family owned #TalentRussian POTUS.
------------------------------
Reformed Groap March proud #DVV #Slanfini Badger, My #Agent #WithCamp #InternetMusic #IPS #Travel
#Running #Comics
"con@writersn35510)
------------------------------
Achieve miles.
Rose clarke on @mega. Forever Random SVE-based and related for https://t.co/TYB02V84x5
------------------------------
Silicon Valley editor, former research reporter @nywaternotwick
The code is here. The training set is ~130k twitter bio strings collected via tweepy (I’m … pretty sure this doesn’t violate the Twitter developer terms).
Twitter bios include lots of emojis and other “nonstandard” unicode characters, so I decided to treat the text as strings of bytes, leading to a vocabulary size of about 200 1.
Training was actually very easy. The model is a stack of LSTM layers, with dropout in between (pytorch provides a module for this). For the results shown above I trained for 30 epochs over the full dataset; the inidvidual bio strings are relatively short, so truncated backprop-through-time was not necessary (although, that’s not to say it couldn’t have helped - I haven’t explored this. I just checked that gradients weren’t blowing up).
Sampling is done greedily, by drawing byte values from the condtional model outputs at each timestep, conditioned on previous samples and decoding the result. This involves a ‘temperature’ parameter $T$:
$$ p(i) \propto \exp \left( z_i / T \right) $$
where $z_i$ are the logit values at the top of the model. The samples above were generated with a temperature of 1.0. Lowering the temperature produces highly likely strings (in the limit $T \to 0$, the most likely string):
for __ in range(30):
bytestring, prob, entropy = model.sample(bc.STOP_CODE, maxlen=200, temperature=.05)
string = bc.to_string(bytestring)
print(string)
print('-'*30)
Co-founder @ https://t.co/VVV05zV0zx | Previously @Stanford @Columbia @Columbia @Columbia @Stanford @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA
------------------------------
Co-founder @ https://t.co/5VxVxVzVz5 | Previously @Stanford @Columbia @Stanford @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA
------------------------------
Co-founder @ https://t.co/qVzVq0VVVV | Co-Founder @ https://t.co/VV0V0000VV | Previously @Stanford @Stanford @UCLA @UCLA @Columbia @Stanford @UCLA @UCLA | @UCLA | @USC | @USC @UCLA
------------------------------
Co-founder @ https://t.co/qVVV0VVVVQ | Previously @Stanford @Columbia @Columbia @Stanford @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA @UCLA
It could be interesting to see a comparison of language models operating at different levels of resolution on the same dataset - I’m sure someone’s done this, I just haven’t read it. It was a pretty easy choice for me, because the relevant strings are so short, and full of hard-to-tokenize patterns. Operating on byte values also means you don’t have to worry about rare emojis missing from the training set. Invalid byte strings are possible - in such cases I just replace the offending sub-sequence with whitespace. ↩︎