One model for the learning of language
 Yuan Yang and Steven T. Piantadosi
 Idea: Given a restricted compositional 'mentalese' programming language / substrate, construct a set of grammatical rules ('hypotheses') from a small number of examples of an (abstract) language.
 Pinker's argument that there is too little stimulus ("paucity of stimulus") for children discern grammatical rules, hence they must be innate, is thereby refuted..
 This is not the only refutation.
 An argument was made on Twitter that large language models also refute the paucity of stimuli hypothesis. Meh, this paper does it far better  the data used to train transformers is hardly small.
 Hypotheses are sampled from the substrate using MCMC, and selected based on a smoothed Bayesian likelihood.
 This likelihood takes into account partial hits  results that are within an edit distance of one of the desired sets of strings. (i think)
 They use Parallel tempering to search the space of programs.
 Roughly: keep alive many different hypotheses, and vary the temperatures of each lineage to avoid getting stuck in local minima.
 But there are other search heuristics; see https://codedocs.xyz/piantado/Fleet/
 Excecution is on the CPU, across multiple cores / threads, possibly across multiple servers.
 Larger hypotheses took up to 7 days to find (!)
 These aren't that complicated of grammars..
 See earlier paper {1572}
 This is very similar to {842}, only on grammars rather than continuous signals from MoCap.
 Proves once again that:
 Many domains of the world can be adequately described by relatively simple computational structures (It's a lowD, compositional world out there)
 Or, the JohnsonLindenstrauss lemma
 You can find those hypotheses through bruteforce + heuristic search. (At least to the point that you run into the curse of dimensionality)
A more interesting result is Deep symbolic regression for recurrent sequences, where the authors (facebook/meta) use a Transformer  in this case, directly taken from Vaswini 2017 (8head, 8layer QKV w/ a latent dimension of 512) to do both symbolic (estimate the algebraic recurrence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (logscaled) noise!
While the language learning paper shows that small generative programs can be inferred from a few samples, the Meta symbolic regression shows that Transformers can evince either amortized memory (less likely) or algorithms for perception  both new and interesting. It suggests that 'even' abstract symbolic learning tasks are sufficiently decomposable that the sorts of algorithms available to an 8layer transformer can give a useful search heuristic. (N.B. That the transformer doesn't spit out perfect symbolic or numerical results directly  it also needs postprocessing search. Also, the transformer algorithm has search (in the form of softmax) baked in to it's architecture.)
This is not a light architecture: they trained the transformer for 250 epochs, where each epoch was 5M equations in batches of 512. Each epoch took 1 hour on 16 Volta GPUs w 32GB of memory. So, 4k GPUhours x ~10 TFlops = 1.4e20 Flops. Compare this with grammar learning above; 7 days on 32 cores operating at ~ 3Gops/sec is 1.8e15 ops. Much, much smaller compute.
All of this is to suggest a central theme of computer science: a continuum between search and memorization.
 The language paper does fast search, but does not learn from the process (bootstrap), and maintains little state/memory.
 The symbolic regression paper does moderate amounts of search, but continually learns form the process, and stores a great deal of heuristics for the problem domain.
Most interesting for a visual neuroscientist (not that I'm one per se, but bear with me) is where on these axes (search, heuristic, memory) visual perception is. Clearly there is a high degree of recurrence, and a high degree of plasticity / learning. But is there search or local optimization? Is this coupled to the recurrence via some form of energyminimizing system? Is recurrence approximating EM? 
Attention is all you need
 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
 Attention is all you need neural network models
 Good summary, along with: The Illustrated Transformer (please refer to this!)

 Åukasz Kaiser mentions a few times how fragile the network is  how easy it is to make something that doesn't train at all, or how many tricks by google experts were needed to make things work properly. it might be bravado or bluffing, but this is arguably not the way that biology fails.
 Encoding:
 Input is words encoded as 512length vectors.
 Vectors are transformed into length 64 vectors: query, key and value via differentiable weight matrices.
 Attention is computed as the dotproduct of the query (current input word) with the keys (values of the other words).
 This value is scaled and passed through a softmax function to result in one attentional signal scaling the value.
 Multiple heads' output are concatenated together, and this output is passed through a final weight matrix to produce a final value for the next layer.
 So, attention in this respect looks like a conditional gain field.
 'Final value' above is then passed through a single layer feedforward net, with resnet style jump.
 Decoding:
 Use the attentional key value from the encoder to determine the first word through the output encoding (?) Not clear.
 Subsequent causal decodes depend on the already 'spoken' words, plus the keyvalues from the encoder.
 Output is a onehot softmax layer from a feedforward layer; the sum total is differentiable from input to output using crossentropy loss or KL divergence.

