{1571} revision 5 modified: 09-05-2022 01:57 gmt |

One model for the learning of language -
*Yuan Yang and Steven T. Piantadosi* - Idea: Given a restricted compositional 'mentalese' programming language / substrate, construct a set of grammatical rules ('hypotheses') from a small number of examples of an (abstract) language.
- Pinker's argument that there is too little stimulus ("paucity of stimulus") for children discern grammatical rules, hence they must be innate, is thereby refuted..
- This is not the only refutation.
- An argument was made on Twitter that large language models also refute the paucity of stimuli hypothesis. Meh, this paper does it far better -- the data used to train transformers is
*hardly small*. - Hypotheses are sampled from the substrate using MCMC, and selected based on a smoothed Bayesian likelihood.
- This likelihood takes into account partial hits -- results that are within an edit distance of one of the desired sets of strings. (i think)
- They use Parallel tempering to search the space of programs.
- Roughly: keep alive many different hypotheses, and vary the temperatures of each lineage to avoid getting stuck in local minima.
- But there are other search heuristics; see https://codedocs.xyz/piantado/Fleet/
- Excecution is on the CPU, across multiple cores / threads, possibly across multiple servers.
- Larger hypotheses took up to 7 days to find (!)
- These aren't that complicated of grammars..
- See earlier paper {1572}
- This is very similar to {842}, only on grammars rather than continuous signals from MoCap.
- Proves once again that:
- Many domains of the world can be adequately described by relatively simple computational structures (It's a low-D, compositional world out there)
- Or, the Johnson-Lindenstrauss lemma
- You can find those hypotheses through brute-force + heuristic search. (At least to the point that you run into the curse of dimensionality)
A more interesting result is Deep symbolic regression for recurrent sequences, where the authors (facebook/meta) use a Transformer -- in this case, directly taken from Vaswini 2017 (8-head, 8-layer QKV w/ a latent dimension of 512) to do both symbolic (estimate the algebraic recurrence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (log-scaled) noise! While the language learning paper shows that small generative programs can be inferred from a few samples, the Meta symbolic regression shows that Transformers can evince either amortized memory (less likely) or This is not a light architecture: they trained the transformer for 250 epochs, where each epoch was 5M equations in batches of 512. Each epoch took 1 hour on 16 Volta GPUs w 32GB of memory. So, 4k GPU-hours x ~10 TFlops = 1.4e20 Flops. Compare this with grammar learning above; 7 days on 32 cores operating at ~ 3Gops/sec is 1.8e15 ops. Much, much smaller compute. All of this is to suggest a central theme of computer science: a continuum between search and memorization. - The language paper does fast search, but does not learn from the process (bootstrap), and maintains little state/memory.
- The symbolic regression paper does moderate amounts of search, but continually learns form the process, and stores a great deal of heuristics for the problem domain.
Most interesting for a visual neuroscientist (not that I'm one per se, but bear with me) is where on these axes (search, heuristic, memory) visual perception is. Clearly there is a high degree of recurrence, and a high degree of plasticity / learning. But is there search or local optimization? Is this coupled to the recurrence via some form of energy-minimizing system? Is recurrence approximating E-M? |