use https for features.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -2020 tags: feedback alignment local hebbian learning rules stanford date: 04-22-2021 03:26 gmt revision:0 [head]

Two Routes to Scalable Credit Assignment without Weight Symmetry

This paper looks at five different learning rules, three purely local, and two non-local, to see if they can work as well as backprop in training a deep convolutional net on ImageNet. The local learning networks all feature forward weights W and backward weights B; the forward weights (+ nonlinearities) pass the information to lead to a classification; the backward weights pass the error, which is used to locally adjust the forward weights.

Hence, each fake neuron has locally the forward activation, the backward error (or loss gradient), the forward weight, backward weight, and Hebbian terms thereof (e.g the outer product of the in-out vectors for both forward and backward passes). From these available variables, they construct the local learning rules:

  • Decay (exponentially decay the backward weights)
  • Amp (Hebbian learning)
  • Null (decay based on the product of the weight and local activation. This effects a Euclidean norm on reconstruction.

Each of these serves as a "regularizer term" on the feedback weights, which governs their learning dynamics. In the case of backprop, the backward weights B are just the instantaneous transpose of the forward weights W. A good local learning rule approximates this transpose progressively. They show that, with proper hyperparameter setting, this does indeed work nearly as well as backprop when training a ResNet-18 network.

But, hyperparameter settings don't translate to other network topologies. To allow this, they add in non-local learning rules:

  • Sparse (penalizes the Euclidean norm of the previous layer; gradient is the outer product of the (current layer activation &transpose) * B)
  • Self (directly measures the forward weights and uses them to update the backward weights)

In "Symmetric Alignment", the Self and Decay rules are employed. This is similar to backprop (the backward weights will track the forward ones) with L2 regularization, which is not new. It performs very similarly to backprop. In "Activation Alignment", Amp and Sparse rules are employed. I assume this is supposed to be more biologically plausible -- the Hebbian term can track the forward weights, while the Sparse rule regularizes and stabilizes the learning, such that overall dynamics allow the gradient to flow even if W and B aren't transposes of each other.

Surprisingly, they find that Symmetric Alignment to be more robust to the injection of Gaussian noise during training than backprop. Both SA and AA achieve similar accuracies on the ResNet benchmark. The authors then go on to explain the plausibility of non-local but approximate learning rules with Regression discontinuity design ala Spiking allows neurons to estimate their causal effect.

This is a decent paper,reasonably well written. They thought trough what variables are available to affect learning, and parameterized five combinations that work. Could they have done the full matrix of combinations, optimizing just they same as the metaparameters? Perhaps, but that would be even more work ...

Regarding the desire to reconcile backprop and biology, this paper does not bring us much (if at all) closer. Biological neural networks have specific and local uses for error; even invoking 'error' has limited explanatory power on activity. Learning and firing dynamics, of course of course. Is the brain then just an overbearing mess of details and overlapping rules? Yes probably but that doesn't mean that we human's can't find something simpler that works. The algorithms in this paper, for example, are well described by a bit of linear algebra, and yet they are performant.

hide / / print
ref: -0 tags: credit assignment distributed feedback alignment penn state MNIST fashion backprop date: 03-16-2019 02:21 gmt revision:1 [0] [head]

Conducting credit assignment by aligning local distributed representations

  • Alexander G. Ororbia, Ankur Mali, Daniel Kifer, C. Lee Giles
  • Propose two related algorithms: Local Representation Alignment (LRA)-diff and LRA-fdbk.
    • LRA-diff is basically a modified form of backprop.
    • LRA-fdbk is a modified version of feedback alignment. {1432} {1423}
  • Test on MNIST (easy -- many digits can be discriminated with one pixel!) and fashion-MNIST (harder -- humans only get about 85% right!)
  • Use a Cauchy or log-penalty loss at each layer, which is somewhat unique and interesting: L(z,y)= i=1 nlog(1+(y iz i) 2)L(z,y) = \sum_{i=1}^n{ log(1 + (y_i - z_i)^2)} .
    • This is hence a saturating loss.
  1. Normal multi-layer-perceptron feedforward network. pre activation h h^\ell and post activation z z^\ell are stored.
  2. Update the weights to minimize loss. This gradient calculation is identical to backprop, only they constrain the update to have a norm no bigger than c 1c_1 . Z and Y are actual and desired output of the layer, as commented. Gradient includes the derivative of the nonlinear activation function.
  3. Generaete update for the pre-nonlinearity h 1h^{\ell-1} to minimize the loss in the layer above. This again is very similar to backprop; its' the chain rule -- but the derivatives are vectors, of course, so those should be element-wise multiplication, not outer produts (i think).
    1. Note hh is updated -- derivatives of two nonlinearities.
  4. Feedback-alignment version, with random matrix E E_{\ell} (elements drawn from a gaussian distribution, σ=1\sigma = 1 ish.
    1. Only one nonlinearity derivative here -- bug?
  5. Move the rep and post activations in the specified gradient direction.
    1. Those h¯ 1\bar{h}^{\ell-1} variables are temporary holding -- but note that both lower and higher layers are updated.
  6. Do this K of times, K=1-50.
  • In practice K=1, with the LRA-fdbk algorithm, for the majority of the paper -- it works much better than LRA-diff (interesting .. bug?). Hence, this basically reduces to feedback alignment.
  • Demonstrate that LRA works much better with small initial weights, but basically because they tweak the algorithm to do this.
    • Need to see a positive control for this to be conclusive.
    • Again, why is FA so different from LRA-fdbk? Suspicious. Positive controls.
  • Attempted a network with Local Winner Take All (LWTA), which is a hard nonlinearity that LFA was able to account for & train through.
  • Also used Bernoulli neurons, and were able to successfully train. Unlike drop-out, these were stochastic at test time, and things still worked OK.

Lit review.
  • Logistic sigmoid can slow down learning, due to it's non-zero mean (Glorot & Bengio 2010).
  • Recirculation algorithm (or generalized recirculation) is a precursor for target propagation.
  • Target propagation is all about the inverse of the forward propagation: if we had access to the inverse of the network of forward propagations, we could compute which input values at the lower levels of the network would result in better values at the top that would please the global cost.
    • This is a very different way of looking at it -- almost backwards!
    • And indeed, it's not really all that different from contrastive divergence. (even though CD doesn't work well with non-Bernoulli units)
  • Contractive Hebbian learning also has two phases, one to fantasize, and done to try to make the fantasies look more like the input data.
  • Decoupled neural interfaces (Jaderberg et al 2016): learn a predictive model of error gradients (and inputs) nistead of trying to use local information to estimate updated weights.

  • Yeah, call me a critic, but I'm not clear on the contribution of this paper; it smells precocious and over-sold.
    • Even the title. I was hoping for something more 'local' than per-layer computation. BP does that already!
  • They primarily report supportive tests, not discriminative or stressing tests; how does the algorithm fail?
    • Certainly a lot of work went into it..
  • I still don't see how the computation of a target through a ransom matrix, then using delta/loss/error between that target and the feedforward activation to update weights, is much different than propagating the errors directly through a random feedback matrix. Eg. subtract then multiply, or multiply then subtract?

hide / / print
ref: -2018 tags: biologically inspired deep learning feedback alignment direct difference target propagation date: 03-15-2019 05:51 gmt revision:5 [4] [3] [2] [1] [0] [head]

Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures

  • Sergey Bartunov, Adam Santoro, Blake A. Richards, Luke Marris, Geoffrey E. Hinton, Timothy Lillicrap
  • As is known, many algorithms work well on MNIST, but fail on more complicated tasks, like CIFAR and ImageNet.
  • In their experiments, backprop still fares better than any of the biologically inspired / biologically plausible learning rules. This includes:
    • Feedback alignment {1432} {1423}
    • Vanilla target propagation
      • Problem: with convergent networks, layer inverses (top-down) will map all items of the same class to one target vector in each layer, which is very limiting.
      • Hence this algorithm was not directly investigated.
    • Difference target propagation (2015)
      • Uses the per-layer target as h^ l=g(h^ l+1;λ l+1)+[h lg(h l+1;λ l+1)]\hat{h}_l = g(\hat{h}_{l+1}; \lambda_{l+1}) + [h_l - g(h_{l+1};\lambda_{l+1})]
      • Or: h^ l=h l+g(h^ l+1;λ l+1)g(h l+1;λ l+1)\hat{h}_l = h_l + g(\hat{h}_{l+1}; \lambda_{l+1}) - g(h_{l+1};\lambda_{l+1}) where λ l\lambda_{l} are the parameters for the inverse model; g()g() is the sum and nonlinearity.
      • That is, the target is modified ala delta rule by the difference between inverse-propagated higher layer target and inverse-propagated higher level activity.
        • Why? h lh_{l} should approach h^ l\hat{h}_{l} as h l+1h_{l+1} approaches h^ l+1\hat{h}_{l+1} .
        • Otherwise, the parameters in lower layers continue to be updated even when low loss is reached in the upper layers. (from original paper).
      • The last to penultimate layer weights is trained via backprop to prevent template impoverishment as noted above.
    • Simplified difference target propagation
      • The substitute a biologically plausible learning rule for the penultimate layer,
      • h^ L1=h L1+g(h^ L;λ L)g(h L;λ L)\hat{h}_{L-1} = h_{L-1} + g(\hat{h}_L;\lambda_L) - g(h_L;\lambda_L) where there are LL layers.
      • It's the same rule as the other layers.
      • Hence subject to impoverishment problem with low-entropy labels.
    • Auxiliary output simplified difference target propagation
      • Add a vector zz to the last layer activation, which carries information about the input vector.
      • zz is just a set of random features from the activation h L1h_{L-1} .
  • Used both fully connected and locally-connected (e.g. convolution without weight sharing) MLP.
  • It's not so great:
  • Target propagation seems like a weak learner, worse than feedback alignment; not only is the feedback limited, but it does not take advantage of the statistics of the input.
    • Hence, some of these schemes may work better when combined with unsupervised learning rules.
    • Still, in the original paper they use difference-target propagation with autoencoders, and get reasonable stroke features..
  • Their general result that networks and learning rules need to be tested on more difficult tasks rings true, and might well be the main point of this otherwise meh paper.

hide / / print
ref: -0 tags: feedback alignment Arild Nokland MNIST CIFAR date: 02-14-2019 02:15 gmt revision:0 [head]

Direct Feedback alignment provides learning in deep neural nets

  • from {1423}
  • Feedback alignment is able to provide zero training error even in convolutional networks and very deep networks, completely without error back-propagation.
  • Biologically plausible: error signal is entirely local, no symmetric or reciprocal weights required.
    • Still, it requires supervision.
  • Almost as good as backprop!
  • Clearly written, easy to follow math.
    • Though the proof that feedback-alignment direction is within 90 deg of backprop is a bit impenetrable, needs some reorganization or additional exposition / annotation.
  • 3x400 tanh network tested on MNIST; performs similarly to backprop, if faster.
  • Also able to train very deep networks, on MNIST - CIFAR-10, CIFAR-100, 100 layers (which actually hurts this task).

hide / / print
ref: -2014 tags: Lillicrap Random feedback alignment weights synaptic learning backprop MNIST date: 02-14-2019 01:02 gmt revision:5 [4] [3] [2] [1] [0] [head]

PMID-27824044 Random synaptic feedback weights support error backpropagation for deep learning.

  • "Here we present a surprisingly simple algorithm for deep learning, which assigns blame by multiplying error signals by a random synaptic weights.
  • Backprop multiplies error signals e by the weight matrix W T W^T , the transpose of the forward synaptic weights.
  • But the feedback weights do not need to be exactly W T W^T ; any matrix B will suffice, so long as on average:
  • e TWBe>0 e^T W B e > 0
    • Meaning that the teaching signal Be B e lies within 90deg of the signal used by backprop, W Te W^T e
  • Feedback alignment actually seems to work better than backprop in some cases. This relies on starting the weights very small (can't be zero -- no output)

Our proof says that weights W0 and W
evolve to equilibrium manifolds, but simulations (Fig. 4) and analytic results (Supple-
mentary Proof 2) hint at something more specific: that when the weights begin near
0, feedback alignment encourages W to act like a local pseudoinverse of B around
the error manifold. This fact is important because if B were exactly W + (the Moore-
Penrose pseudoinverse of W ), then the network would be performing Gauss-Newton
optimization (Supplementary Proof 3). We call this update rule for the hidden units
pseudobackprop and denote it by ∆hPBP = W + e. Experiments with the linear net-
work show that the angle, ∆hFA ]∆hPBP quickly becomes smaller than ∆hFA ]∆hBP
(Fig. 4b, c; see Methods). In other words feedback alignment, despite its simplicity,
displays elements of second-order learning.