
I trained a recurrent model to generate Twitter bios. Some of them are pretty good - perhaps not unreasonably good, but pretty good.

We back the United States to NLP pathology for Canada!
Sr. Co-Queer | Activist - jesus      #MachineLearning
Artist. Champion. I am championing the USA Blue secrets and teaching software practice. Family owned #TalentRussian POTUS.
Reformed Groap March proud #DVV #Slanfini Badger, My  #Agent #WithCamp #InternetMusic #IPS #Travel
#Running #Comics
Achieve miles. 
Rose clarke on @mega. Forever Random SVE-based and related for https://t.co/TYB02V84x5
Silicon Valley editor, former research reporter @nywaternotwick

The code is here. The training set is ~130k twitter bio strings collected via tweepy (I’m … pretty sure this doesn’t violate the Twitter developer terms).

Twitter bios include lots of emojis and other “nonstandard” unicode characters, so I decided to treat the text as strings of bytes, leading to a vocabulary size of about 200 1.

Training was actually very easy. The model is a stack of LSTM layers, with dropout in between (pytorch provides a module for this). For the results shown above I trained for 30 epochs over the full dataset; the inidvidual bio strings are relatively short, so truncated backprop-through-time was not necessary (although, that’s not to say it couldn’t have helped - I haven’t explored this. I just checked that gradients weren’t blowing up).

Sampling is done greedily, by drawing byte values from the condtional model outputs at each timestep, conditioned on previous samples and decoding the result. This involves a ‘temperature’ parameter T:

p(i) ∝ exp (z_i/T)

where zi are the logit values at the top of the model. The samples above were generated with a temperature of 1.0. Lowering the temperature produces highly likely strings (in the limit T → 0, the most likely string):

for __ in range(30):
  bytestring, prob, entropy = model.sample(bc.STOP_CODE, maxlen=200, temperature=.05)
  string = bc.to_string(bytestring)
