{1569} revision 0 modified: 05-17-2022 20:25 gmt

Deep symbolic regression for recurrent sequences

Surprisingly, they do not do any network structure changes; it’s Vaswini 2017w/ a 8-head, 8 layer transformer (sequence to sequence, not decoder only) with a latent dimension of 512.  Significant work was in feature / representation engineering (e.g. base-10k representations of integers and fixed-precision representations of floating-point numbers. (both of these involve a vocabulary size of ~10k ... amazing still that this works..)) + the significant training regimen they worked with (16 Turing GPUs, 32gb ea).  Note that they do perform a bit of beam-search over the symbolic regressions by checking how well each node fits to the starting sequence, but the models work even without this degree of refinement. (As always, there undoubtedly was significant effort spent in simply getting everything to work)

The paper does both symbolic (estimate the algebraic recurence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (log-scaled) noise!

Analysis of how the transformers work for these problems is weak; only one figure showing that the embeddings of the integers follows some meandering but continuous path in t-SNE space. Still, the trained transformer is able to usually best hand-coded sequence inference engine(s) in Mathematica, and does so without memorizing all of the training data. Very impressive and important result, enough to convince that this learned representation (and undiscovered cleverness, perhaps) beats human mathematical engineering, which probably took longer and took more effort.

It follows, without too much imagination (but vastly more compute), that you can train an 'automatic programmer' in the very same way.