You are not authenticated, login.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -2022 tags: language learning symbolic regression Fleet meta search date: 06-04-2022 02:28 gmt revision:4 [3] [2] [1] [0] [head]

One model for the learning of language

  • Yuan Yang and Steven T. Piantadosi
  • Idea: Given a restricted compositional 'mentalese' programming language / substrate, construct a set of grammatical rules ('hypotheses') from a small number of examples of an (abstract) language.
    • Pinker's argument that there is too little stimulus ("paucity of stimulus") for children discern grammatical rules, hence they must be innate, is thereby refuted..
      • This is not the only refutation.
      • An argument was made on Twitter that large language models also refute the paucity of stimuli hypothesis. Meh, this paper does it far better -- the data used to train transformers is hardly small.
  • Hypotheses are sampled from the substrate using MCMC, and selected based on a smoothed Bayesian likelihood.
    • This likelihood takes into account partial hits -- results that are within an edit distance of one of the desired sets of strings. (i think)
  • They use Parallel tempering to search the space of programs.
    • Roughly: keep alive many different hypotheses, and vary the temperatures of each lineage to avoid getting stuck in local minima.
    • But there are other search heuristics; see https://codedocs.xyz/piantado/Fleet/
  • Excecution is on the CPU, across multiple cores / threads, possibly across multiple servers.
  • Larger hypotheses took up to 7 days to find (!)
    • These aren't that complicated of grammars..

  • This is very similar to {842}, only on grammars rather than continuous signals from MoCap.
  • Proves once again that:
    1. Many domains of the world can be adequately described by relatively simple computational structures (It's a low-D, compositional world out there)
      1. Or, the Johnson-Lindenstrauss lemma
    2. You can find those hypotheses through brute-force + heuristic search. (At least to the point that you run into the curse of dimensionality)

A more interesting result is Deep symbolic regression for recurrent sequences, where the authors (facebook/meta) use a Transformer -- in this case, directly taken from Vaswini 2017 (8-head, 8-layer QKV w/ a latent dimension of 512) to do both symbolic (estimate the algebraic recurrence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (log-scaled) noise!

While the language learning paper shows that small generative programs can be inferred from a few samples, the Meta symbolic regression shows that Transformers can evince either amortized memory (less likely) or algorithms for perception -- both new and interesting. It suggests that 'even' abstract symbolic learning tasks are sufficiently decomposable that the sorts of algorithms available to an 8-layer transformer can give a useful search heuristic. (N.B. That the transformer doesn't spit out perfect symbolic or numerical results directly -- it also needs post-processing search. Also, the transformer algorithm has search (in the form of softmax) baked in to it's architecture.)

This is not a light architecture: they trained the transformer for 250 epochs, where each epoch was 5M equations in batches of 512. Each epoch took 1 hour on 16 Volta GPUs w 32GB of memory. So, 4k GPU-hours x ~10 TFlops = 1.4e20 Flops. Compare this with grammar learning above; 7 days on 32 cores operating at ~ 3Gops/sec is 1.8e15 ops. Much, much smaller compute.

All of this is to suggest a central theme of computer science: a continuum between search and memorization.

  • The language paper does fast search, but does not learn from the process (bootstrap), and maintains little state/memory.
  • The symbolic regression paper does moderate amounts of search, but continually learns form the process, and stores a great deal of heuristics for the problem domain.

Most interesting for a visual neuroscientist (not that I'm one per se, but bear with me) is where on these axes (search, heuristic, memory) visual perception is. Clearly there is a high degree of recurrence, and a high degree of plasticity / learning. But is there search or local optimization? Is this coupled to the recurrence via some form of energy-minimizing system? Is recurrence approximating E-M?

hide / / print
ref: -2022 tags: symbolic regression facebook AI transformer date: 05-17-2022 20:25 gmt revision:0 [head]

Deep symbolic regression for recurrent sequences

Surprisingly, they do not do any network structure changes; it’s Vaswini 2017w/ a 8-head, 8 layer transformer (sequence to sequence, not decoder only) with a latent dimension of 512.  Significant work was in feature / representation engineering (e.g. base-10k representations of integers and fixed-precision representations of floating-point numbers. (both of these involve a vocabulary size of ~10k ... amazing still that this works..)) + the significant training regimen they worked with (16 Turing GPUs, 32gb ea).  Note that they do perform a bit of beam-search over the symbolic regressions by checking how well each node fits to the starting sequence, but the models work even without this degree of refinement. (As always, there undoubtedly was significant effort spent in simply getting everything to work)

The paper does both symbolic (estimate the algebraic recurence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (log-scaled) noise!

Analysis of how the transformers work for these problems is weak; only one figure showing that the embeddings of the integers follows some meandering but continuous path in t-SNE space. Still, the trained transformer is able to usually best hand-coded sequence inference engine(s) in Mathematica, and does so without memorizing all of the training data. Very impressive and important result, enough to convince that this learned representation (and undiscovered cleverness, perhaps) beats human mathematical engineering, which probably took longer and took more effort.

It follows, without too much imagination (but vastly more compute), that you can train an 'automatic programmer' in the very same way.

hide / / print
ref: -0 tags: kernel regression structure discovery fitting gaussian process date: 09-24-2018 22:09 gmt revision:1 [0] [head]

Structure discovery in Nonparametric Regression through Compositional Kernel Search

  • Use Gaussian process kernels (squared exponential, periodic, linear, and ratio-quadratic)
  • to model a kernel function, k(x,x)k(x,x') which specifies how similar or correlated outputs yy and yy' are expected to be at two points $$x$ and xx' .
    • By defining the measure of similarity between inputs, the kernel determines the pattern of inductive generalization.
    • This is different than modeling the mapping y=f(x)y = f(x) .
    • It's something more like y=N(m(x)+k(x,x))y' = N(m(x') + k(x,x')) -- check the appendix.
    • See also: http://rsta.royalsocietypublishing.org/content/371/1984/20110550
  • Gaussian process models use a kernel to define the covariance between any two function values: Cov(y,y)=k(x,x)Cov(y,y') = k(x,x') .
  • This kernel family is closed under addition and multiplication, and provides an interpretable structure.
  • Search for kernel structure greedily & compositionally,
    • then optimize parameters with conjugate gradients with restarts.
    • This seems straightforwardly intuitive...
  • Kernels are scored with the BIC.
  • C.f. {842} -- "Because we learn expressions describing the covariance structure rather than the functions themselves, we are able to capture structure which does not have a simple parametric form."
  • All their figure examples are 1-D time-series, which is kinda boring, but makes sense for creating figures.
    • Tested on multidimensional (d=4) synthetic data too.
    • Not sure how they back out modeling the covariance into actual predictions -- just draw (integrate) from the distribution?

hide / / print
ref: math-0 tags: partial least squares PLS regression thesis italy date: 03-26-2007 16:48 gmt revision:2 [1] [0] [head]


  • pdf does not seem to open in linux? no, doesn't open on windows either - the Pdf is screwed up!
  • here is a published version of his work.

hide / / print
ref: bookmark-0 tags: statistics logistic regression binomial logit BIC AIC SPSS date: 0-0-2006 0:0 revision:0 [head]


  • transform probabilities into logarithmic variables = logits