PMID-22325196 Backpropagation through time and the brain
- Timothy Lillicrap and Adam Santoro
- Backpropagation through time: the 'canonical' expansion of backprop to assign credit in recurrent neural networks used in machine learning.
- E.g. variable rol-outs, where the error is propagated many times through the recurrent weight matrix, .
- This leads to the exploding or vanishing gradient problem.
- TCA = temporal credit assignment. What lead to this reward or error? How to affect memory to encourage or avoid this?
- One approach is to simply truncate the error: truncated backpropagation through time (TBPTT). But this of course limits the horizon of learning.
- The brain may do BPTT via replay in both the hippocampus and cortex Nat. Neuroscience 2007, thereby alleviating the need to retain long time histories of neuron activations (needed for derivative and credit assignment).
- Less known method of TCA uses RTRL Real-time recurrent learning forward mode differentiation -- is computed and maintained online, often with synaptic weight updates being applied at each time step in which there is non-zero error. See A learning algorithm for continually running fully recurrent neural networks.
- Big problem: A network with recurrent units requires storage and computation at each time-step.
- Can be solved with Unbiased Online Recurrent optimization, which stores approximate but unbiased gradient estimates to reduce comp / storage.
- Attention seems like a much better way of approaching the TCA problem: past events are stored externally, and the network learns a differentiable attention-alignment module for selecting these events.
- Memory can be finite size, extending, or self-compressing.
- Highlight the utility/necessity of content-addressable memory.
- Attentional gating can eliminate the exploding / vanishing / corrupting gradient problems -- the gradient paths are skip-connections.
- Biologically plausible: partial reactivation of CA3 memories induces re-activation of neocortical neurons responsible for initial encoding PMID-15685217 The organization of recent and remote memories. 2005
- I remain reserved about the utility of thinking in terms of gradients when describing how the brain learns. Correlations, yes; causation, absolutely; credit assignment, for sure. Yet propagating gradients as a means for changing netwrok weights seems at best a part of the puzzle. So much of behavior and internal cognitive life involves explicit, conscious computation of cause and credit.
- This leaves me much more sanguine about the use of external memory to guide behavior ... but differentiable attention? Hmm.
|