m8ta
You are not authenticated, login. 

{1571}  
One model for the learning of language
A more interesting result is Deep symbolic regression for recurrent sequences, where the authors (facebook/meta) use a Transformer  in this case, directly taken from Vaswini 2017 (8head, 8layer QKV w/ a latent dimension of 512) to do both symbolic (estimate the algebraic recurrence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (logscaled) noise! While the language learning paper shows that small generative programs can be inferred from a few samples, the Meta symbolic regression shows that Transformers can evince either amortized memory (less likely) or algorithms for perception  both new and interesting. It suggests that 'even' abstract symbolic learning tasks are sufficiently decomposable that the sorts of algorithms available to an 8layer transformer can give a useful search heuristic. (N.B. That the transformer doesn't spit out perfect symbolic or numerical results directly  it also needs postprocessing search. Also, the transformer algorithm has search (in the form of softmax) baked in to it's architecture.) This is not a light architecture: they trained the transformer for 250 epochs, where each epoch was 5M equations in batches of 512. Each epoch took 1 hour on 16 Volta GPUs w 32GB of memory. So, 4k GPUhours x ~10 TFlops = 1.4e20 Flops. Compare this with grammar learning above; 7 days on 32 cores operating at ~ 3Gops/sec is 1.8e15 ops. Much, much smaller compute. All of this is to suggest a central theme of computer science: a continuum between search and memorization.
Most interesting for a visual neuroscientist (not that I'm one per se, but bear with me) is where on these axes (search, heuristic, memory) visual perception is. Clearly there is a high degree of recurrence, and a high degree of plasticity / learning. But is there search or local optimization? Is this coupled to the recurrence via some form of energyminimizing system? Is recurrence approximating EM?  
{1569} 
ref: 2022
tags: symbolic regression facebook AI transformer
date: 05172022 20:25 gmt
revision:0
[head]


Deep symbolic regression for recurrent sequences Surprisingly, they do not do any network structure changes; it’s Vaswini 2017w/ a 8head, 8 layer transformer (sequence to sequence, not decoder only) with a latent dimension of 512. Significant work was in feature / representation engineering (e.g. base10k representations of integers and fixedprecision representations of floatingpoint numbers. (both of these involve a vocabulary size of ~10k ... amazing still that this works..)) + the significant training regimen they worked with (16 Turing GPUs, 32gb ea). Note that they do perform a bit of beamsearch over the symbolic regressions by checking how well each node fits to the starting sequence, but the models work even without this degree of refinement. (As always, there undoubtedly was significant effort spent in simply getting everything to work) The paper does both symbolic (estimate the algebraic recurence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (logscaled) noise! Analysis of how the transformers work for these problems is weak; only one figure showing that the embeddings of the integers follows some meandering but continuous path in tSNE space. Still, the trained transformer is able to usually best handcoded sequence inference engine(s) in Mathematica, and does so without memorizing all of the training data. Very impressive and important result, enough to convince that this learned representation (and undiscovered cleverness, perhaps) beats human mathematical engineering, which probably took longer and took more effort. It follows, without too much imagination (but vastly more compute), that you can train an 'automatic programmer' in the very same way. 