Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures
- Sergey Bartunov, Adam Santoro, Blake A. Richards, Luke Marris, Geoffrey E. Hinton, Timothy Lillicrap
- As is known, many algorithms work well on MNIST, but fail on more complicated tasks, like CIFAR and ImageNet.
- In their experiments, backprop still fares better than any of the biologically inspired / biologically plausible learning rules. This includes:
- Feedback alignment {1432} {1423}
- Vanilla target propagation
- Problem: with convergent networks, layer inverses (top-down) will map all items of the same class to one target vector in each layer, which is very limiting.
- Hence this algorithm was not directly investigated.
- Difference target propagation (2015)
- Uses the per-layer target as
- Or: where are the parameters for the inverse model; is the sum and nonlinearity.
- That is, the target is modified ala delta rule by the difference between inverse-propagated higher layer target and inverse-propagated higher level activity.
- Why? should approach as approaches .
- Otherwise, the parameters in lower layers continue to be updated even when low loss is reached in the upper layers. (from original paper).
- The last to penultimate layer weights is trained via backprop to prevent template impoverishment as noted above.
- Simplified difference target propagation
- The substitute a biologically plausible learning rule for the penultimate layer,
- where there are layers.
- It's the same rule as the other layers.
- Hence subject to impoverishment problem with low-entropy labels.
- Auxiliary output simplified difference target propagation
- Add a vector to the last layer activation, which carries information about the input vector.
- is just a set of random features from the activation .
- Used both fully connected and locally-connected (e.g. convolution without weight sharing) MLP.
-
- It's not so great:
-
- Target propagation seems like a weak learner, worse than feedback alignment; not only is the feedback limited, but it does not take advantage of the statistics of the input.
- Hence, some of these schemes may work better when combined with unsupervised learning rules.
- Still, in the original paper they use difference-target propagation with autoencoders, and get reasonable stroke features..
- Their general result that networks and learning rules need to be tested on more difficult tasks rings true, and might well be the main point of this otherwise meh paper.
|