{1571} revision 4 modified: 06-04-2022 02:28 gmt

One model for the learning of language

  • Yuan Yang and Steven T. Piantadosi
  • Idea: Given a restricted compositional 'mentalese' programming language / substrate, construct a set of grammatical rules ('hypotheses') from a small number of examples of an (abstract) language.
    • Pinker's argument that there is too little stimulus ("paucity of stimulus") for children discern grammatical rules, hence they must be innate, is thereby refuted..
      • This is not the only refutation.
      • An argument was made on Twitter that large language models also refute the paucity of stimuli hypothesis. Meh, this paper does it far better -- the data used to train transformers is hardly small.
  • Hypotheses are sampled from the substrate using MCMC, and selected based on a smoothed Bayesian likelihood.
    • This likelihood takes into account partial hits -- results that are within an edit distance of one of the desired sets of strings. (i think)
  • They use Parallel tempering to search the space of programs.
    • Roughly: keep alive many different hypotheses, and vary the temperatures of each lineage to avoid getting stuck in local minima.
    • But there are other search heuristics; see https://codedocs.xyz/piantado/Fleet/
  • Excecution is on the CPU, across multiple cores / threads, possibly across multiple servers.
  • Larger hypotheses took up to 7 days to find (!)
    • These aren't that complicated of grammars..

  • This is very similar to {842}, only on grammars rather than continuous signals from MoCap.
  • Proves once again that:
    1. Many domains of the world can be adequately described by relatively simple computational structures (It's a low-D, compositional world out there)
      1. Or, the Johnson-Lindenstrauss lemma
    2. You can find those hypotheses through brute-force + heuristic search. (At least to the point that you run into the curse of dimensionality)

A more interesting result is Deep symbolic regression for recurrent sequences, where the authors (facebook/meta) use a Transformer -- in this case, directly taken from Vaswini 2017 (8-head, 8-layer QKV w/ a latent dimension of 512) to do both symbolic (estimate the algebraic recurrence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (log-scaled) noise!

While the language learning paper shows that small generative programs can be inferred from a few samples, the Meta symbolic regression shows that Transformers can evince either amortized memory (less likely) or algorithms for perception -- both new and interesting. It suggests that 'even' abstract symbolic learning tasks are sufficiently decomposable that the sorts of algorithms available to an 8-layer transformer can give a useful search heuristic. (N.B. That the transformer doesn't spit out perfect symbolic or numerical results directly -- it also needs post-processing search. Also, the transformer algorithm has search (in the form of softmax) baked in to it's architecture.)

This is not a light architecture: they trained the transformer for 250 epochs, where each epoch was 5M equations in batches of 512. Each epoch took 1 hour on 16 Volta GPUs w 32GB of memory. So, 4k GPU-hours x ~10 TFlops = 1.4e20 Flops. Compare this with grammar learning above; 7 days on 32 cores operating at ~ 3Gops/sec is 1.8e15 ops. Much, much smaller compute.

All of this is to suggest a central theme of computer science: a continuum between search and memorization.

  • The language paper does fast search, but does not learn from the process (bootstrap), and maintains little state/memory.
  • The symbolic regression paper does moderate amounts of search, but continually learns form the process, and stores a great deal of heuristics for the problem domain.

Most interesting for a visual neuroscientist (not that I'm one per se, but bear with me) is where on these axes (search, heuristic, memory) visual perception is. Clearly there is a high degree of recurrence, and a high degree of plasticity / learning. But is there search or local optimization? Is this coupled to the recurrence via some form of energy-minimizing system? Is recurrence approximating E-M?