m8ta
You are not authenticated, login.
text: sort by
tags: modified
type: chronology
{1571}
hide / / print
ref: -2022 tags: language learning symbolic regression Fleet meta search date: 06-04-2022 02:28 gmt revision:4 [3] [2] [1] [0] [head]

One model for the learning of language

  • Yuan Yang and Steven T. Piantadosi
  • Idea: Given a restricted compositional 'mentalese' programming language / substrate, construct a set of grammatical rules ('hypotheses') from a small number of examples of an (abstract) language.
    • Pinker's argument that there is too little stimulus ("paucity of stimulus") for children discern grammatical rules, hence they must be innate, is thereby refuted..
      • This is not the only refutation.
      • An argument was made on Twitter that large language models also refute the paucity of stimuli hypothesis. Meh, this paper does it far better -- the data used to train transformers is hardly small.
  • Hypotheses are sampled from the substrate using MCMC, and selected based on a smoothed Bayesian likelihood.
    • This likelihood takes into account partial hits -- results that are within an edit distance of one of the desired sets of strings. (i think)
  • They use Parallel tempering to search the space of programs.
    • Roughly: keep alive many different hypotheses, and vary the temperatures of each lineage to avoid getting stuck in local minima.
    • But there are other search heuristics; see https://codedocs.xyz/piantado/Fleet/
  • Excecution is on the CPU, across multiple cores / threads, possibly across multiple servers.
  • Larger hypotheses took up to 7 days to find (!)
    • These aren't that complicated of grammars..

  • This is very similar to {842}, only on grammars rather than continuous signals from MoCap.
  • Proves once again that:
    1. Many domains of the world can be adequately described by relatively simple computational structures (It's a low-D, compositional world out there)
      1. Or, the Johnson-Lindenstrauss lemma
    2. You can find those hypotheses through brute-force + heuristic search. (At least to the point that you run into the curse of dimensionality)

A more interesting result is Deep symbolic regression for recurrent sequences, where the authors (facebook/meta) use a Transformer -- in this case, directly taken from Vaswini 2017 (8-head, 8-layer QKV w/ a latent dimension of 512) to do both symbolic (estimate the algebraic recurrence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (log-scaled) noise!

While the language learning paper shows that small generative programs can be inferred from a few samples, the Meta symbolic regression shows that Transformers can evince either amortized memory (less likely) or algorithms for perception -- both new and interesting. It suggests that 'even' abstract symbolic learning tasks are sufficiently decomposable that the sorts of algorithms available to an 8-layer transformer can give a useful search heuristic. (N.B. That the transformer doesn't spit out perfect symbolic or numerical results directly -- it also needs post-processing search. Also, the transformer algorithm has search (in the form of softmax) baked in to it's architecture.)

This is not a light architecture: they trained the transformer for 250 epochs, where each epoch was 5M equations in batches of 512. Each epoch took 1 hour on 16 Volta GPUs w 32GB of memory. So, 4k GPU-hours x ~10 TFlops = 1.4e20 Flops. Compare this with grammar learning above; 7 days on 32 cores operating at ~ 3Gops/sec is 1.8e15 ops. Much, much smaller compute.

All of this is to suggest a central theme of computer science: a continuum between search and memorization.

  • The language paper does fast search, but does not learn from the process (bootstrap), and maintains little state/memory.
  • The symbolic regression paper does moderate amounts of search, but continually learns form the process, and stores a great deal of heuristics for the problem domain.

Most interesting for a visual neuroscientist (not that I'm one per se, but bear with me) is where on these axes (search, heuristic, memory) visual perception is. Clearly there is a high degree of recurrence, and a high degree of plasticity / learning. But is there search or local optimization? Is this coupled to the recurrence via some form of energy-minimizing system? Is recurrence approximating E-M?

{1440}
hide / / print
ref: -2017 tags: attention transformer language model youtube google tech talk date: 02-26-2019 20:28 gmt revision:3 [2] [1] [0] [head]

Attention is all you need

  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
  • Attention is all you need neural network models
  • Good summary, along with: The Illustrated Transformer (please refer to this!)
  • Łukasz Kaiser mentions a few times how fragile the network is -- how easy it is to make something that doesn't train at all, or how many tricks by google experts were needed to make things work properly. it might be bravado or bluffing, but this is arguably not the way that biology fails.
  • Encoding:
  • Input is words encoded as 512-length vectors.
  • Vectors are transformed into length 64 vectors: query, key and value via differentiable weight matrices.
  • Attention is computed as the dot-product of the query (current input word) with the keys (values of the other words).
    • This value is scaled and passed through a softmax function to result in one attentional signal scaling the value.
  • Multiple heads' output are concatenated together, and this output is passed through a final weight matrix to produce a final value for the next layer.
    • So, attention in this respect looks like a conditional gain field.
  • 'Final value' above is then passed through a single layer feedforward net, with resnet style jump.
  • Decoding:
  • Use the attentional key value from the encoder to determine the first word through the output encoding (?) Not clear.
  • Subsequent causal decodes depend on the already 'spoken' words, plus the key-values from the encoder.
  • Output is a one-hot softmax layer from a feedforward layer; the sum total is differentiable from input to output using cross-entropy loss or KL divergence.

{442}
hide / / print
ref: bookmark-0 tags: language learning year french brain hack date: 09-03-2007 04:13 gmt revision:2 [1] [0] [head]

http://mirror.mricon.com/french/french.html -- "how i learned french in a year"

  • verbiste : verb conjugator for linux (Gnome)
  • When talking about software, it was FredBrooks in TheMythicalManMonth who said that people will always reinvent the wheel because it is intrinsically easier and more fun to write your own code than it is read someone else's code.