{1547} revision 2 modified: 08-05-2021 01:07 gmt

Meta-Learning Update Rules for Unsupervised Representation Learning

  • Central idea: meta-train a training-network (a MLP) which trains a task-network (also a MLP) to do unsupervised learning on one dataset.
  • The training network is optimized through SGD based on small-shot linear learning on a test set, typically different from the unsupervised training set.
  • The training-network is a per-weight MLP which takes in layer input, layer output, and a synthetic error (denoted η\eta ), and generates a and b, which are then fed into an outer-product Hebbian learning rule.
  • η\eta itself is formed through a backward pass through weights VV , which affords something like backprop -- but not exactly backprop, of course. See the figure.
  • Training consists of building up very large, backward through time gradient estimates relative to the parameters of the training-network. (And there are a lot!)
  • Trained on CIFAR10, MNIST, FashionMNIST, IMDB sentiment prediction. All have their input permuted to keep the training-network from learning per-task weights. Instead the network should learn to interpret the statistics between datapoints.
  • Indeed, it does this -- albeit with limits. Performance is OK, but only if you only do supervised learning on the very limited dataset used in the meta-optimization.
    • In practice, it's possible to completely solve tasks like MNIST with supervised learning; this gets to about 80% accuracy.
  • Images were kept small -- about 20x20 -- to speed up the inner loop unsupervised learning. Still, this took on the order of 200 hours across ~500 TPUs.
  • See, as a comparison, Keren's paper, Meta-learning biologically plausible semi-supervised update rules. It's conceptually nice but only evaluates the two-moons and two-gaussian datasets.

This is a clearly-written, easy to understand paper. The results are not highly compelling, but as a first set of experiments, it's successful enough.

I wonder what more constraints (fewer parameters, per the genome), more options for architecture modifications (e.g. different feedback schemes, per neurobiology), and a black-box optimization algorithm (evolution) would do?