Meta-Learning Update Rules for Unsupervised Representation Learning
- Central idea: meta-train a training-network (a MLP) which trains a task-network (also a MLP) to do unsupervised learning on one dataset.
- The training network is optimized through SGD based on small-shot linear learning on a test set, typically different from the unsupervised training set.
-
- The training-network is a per-weight MLP which takes in layer input, layer output, and a synthetic error (denoted ), and generates a and b, which are then fed into an outer-product Hebbian learning rule.
- itself is formed through a backward pass through weights , which affords something like backprop -- but not exactly backprop, of course. See the figure.
- Training consists of building up very large, backward through time gradient estimates relative to the parameters of the training-network. (And there are a lot!)
- Trained on CIFAR10, MNIST, FashionMNIST, IMDB sentiment prediction. All have their input permuted to keep the training-network from learning per-task weights. Instead the network should learn to interpret the statistics between datapoints.
- Indeed, it does this -- albeit with limits. Performance is OK, but only if you only do supervised learning on the very limited dataset used in the meta-optimization.
- In practice, it's possible to completely solve tasks like MNIST with supervised learning; this gets to about 80% accuracy.
- Images were kept small -- about 20x20 -- to speed up the inner loop unsupervised learning. Still, this took on the order of 200 hours across ~500 TPUs.
- See, as a comparison, Keren's paper, Meta-learning biologically plausible semi-supervised update rules. It's conceptually nice but only evaluates the two-moons and two-gaussian datasets.
This is a clearly-written, easy to understand paper. The results are not highly compelling, but as a first set of experiments, it's successful enough.
I wonder what more constraints (fewer parameters, per the genome), more options for architecture modifications (e.g. different feedback schemes, per neurobiology), and a black-box optimization algorithm (evolution) would do? |