PMID26659050 Human level concept learning through probabalistic program induction
 Preface:
 How do people learn new concepts from just one or a few examples?
 And how do people learn such abstract, rich, and flexible representations?
 How can learning succeed from such sparse dataset also produce such rich representations?
 For any theory of learning, fitting a more complicated model requires more data, not less, to achieve some measure of good generalization, usually in the difference between new and old examples.
 Learning proceeds bu constructing programs that best explain the observations under a Bayesian criterion, and the model 'learns to learn' by developing hierarchical priors that allow previous experience with related concepts to ease learning of new concepts.
 These priors represent learned inductive bias that abstracts the key regularities and dimensions of variation holding actoss both types of concepts and across instances.
 BPL can construct new programs by reusing pieced of existing ones, capturing the causal and compositional properties of realworld generative processes operating on multiple scales.

 Posterior inference requires searching the large combinatorial space of programs that could have generated a raw image.
 Our strategy uses fast bottomup methods (31) to propose a range of candidate parses.
 That is, they reduce the character to a set of lines (series of line segments), then simply the intersection of those lines, and run a series of parses to estimate the generation of those lines, with heuristic criteria to encourage continuity (e.g. no sharp angles, penalty for abruptly changing direction, etc).
 The most promising candidates are refined by using continuous optimization and local search, forming a discrete approximation to the posterior distribution P(program, parameters  image).

PMID22325196 Backpropagation through time and the brain
 Timothy Lillicrap and Adam Santoro
 Backpropagation through time: the 'canonical' expansion of backprop to assign credit in recurrent neural networks used in machine learning.
 E.g. variable rolouts, where the error is propagated many times through the recurrent weight matrix, $W^T$ .
 This leads to the exploding or vanishing gradient problem.
 TCA = temporal credit assignment. What lead to this reward or error? How to affect memory to encourage or avoid this?
 One approach is to simply truncate the error: truncated backpropagation through time (TBPTT). But this of course limits the horizon of learning.
 The brain may do BPTT via replay in both the hippocampus and cortex Nat. Neuroscience 2007, thereby alleviating the need to retain long time histories of neuron activations (needed for derivative and credit assignment).
 Less known method of TCA uses RTRL Realtime recurrent learning forward mode differentiation  $\delta h_t / \delta \theta$ is computed and maintained online, often with synaptic weight updates being applied at each time step in which there is nonzero error. See A learning algorithm for continually running fully recurrent neural networks.
 Big problem: A network with $N$ recurrent units requires $O(N^3)$ storage and $O(N^4)$ computation at each timestep.
 Can be solved with Unbiased Online Recurrent optimization, which stores approximate but unbiased gradient estimates to reduce comp / storage.
 Attention seems like a much better way of approaching the TCA problem: past events are stored externally, and the network learns a differentiable attentionalignment module for selecting these events.
 Memory can be finite size, extending, or selfcompressing.
 Highlight the utility/necessity of contentaddressable memory.
 Attentional gating can eliminate the exploding / vanishing / corrupting gradient problems  the gradient paths are skipconnections.
 Biologically plausible: partial reactivation of CA3 memories induces reactivation of neocortical neurons responsible for initial encoding PMID15685217 The organization of recent and remote memories. 2005
 I remain reserved about the utility of thinking in terms of gradients when describing how the brain learns. Correlations, yes; causation, absolutely; credit assignment, for sure. Yet propagating gradients as a means for changing netwrok weights seems at best a part of the puzzle. So much of behavior and internal cognitive life involves explicit, conscious computation of cause and credit.
 This leaves me much more sanguine about the use of external memory to guide behavior ... but differentiable attention? Hmm.
