m8ta
You are not authenticated, login. 

{1570}  
Kickback cuts Backprop's redtape: Biologically plausible credit assignment in neural networks Bit of a meh  idea is, rather than propagating error signals backwards through a hierarchy, you propagate only one layer + use a signed global reward signal. This works by keeping the network ‘coherent’  positive neurons have positive input weights, and negative neurons have negative weights, such that the overall effect of a weight change does not change sign when propagated forward through the network. This is kind of a lame shortcut, imho, as it limits the types of functions that the network can model & the computational structure of the network. This is already quite limited by the dotproductrectifier common structure (as is used here). Much more interesting and possibly necessary (given much deeper architectures now) is to allow units to change sign. (Open question as to whether they actually frequently do!). As such, the model is in the vein of "how do we make backprop biologically plausible by removing features / communication" rather than "what sorts of signals and changes does the brain use perceive and generate behavior". This is also related to the literature on what ResNets do; what are the skip connections for? Amthropic has some interesting analyses for Transformer architectures, but checking the literature on other resnets is for another time.  
{1544}  
The HSIC Bottleneck: Deep learning without Backpropagation In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set perlayer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbertschmidt independence criterion, as the independence measure. The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the layer representation and the labels while minimizing the mutual information between the representation and the input: $\frac{min}{P_{T_i  X}} I(X; T_i)  \beta I(T_i; Y)$ Where $T_i$ is the hidden representation at layer i (later output), $X$ is the layer input, and $Y$ are the labels. By replacing $I()$ with the HSIC, and some derivation (?), they show that $HSIC(D) = (m1)^{2} tr(K_X H K_Y H)$ Where $D = {(x_1,y_1), ... (x_m, y_m)}$ are samples and labels, $K_{X_{ij}} = k(x_i, x_j)$ and $K_{Y_{ij}} = k(y_i, y_j)$  that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, $k(x,y) = exp(1/2 xy^2/\sigma^2)$ . So, if all the x and y are on average independent, then the innerproduct will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices. But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outerproduct spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this... For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3factor Hebbian learning in deep networks albeit in a much less intelligible way. Robust Learning with the HilbertSchmidt Independence Criterion Is another, later, paper using the HSIC. Their interpretation: "This lossfunction encourages learning models where the distribution of the residuals between the label and the model prediction is statistically independent of the distribution of the instances themselves." Hence, given above nomenclature, $E_X( P_{T_i  X} I(X ; T_i) ) = 0$ (I'm not totally sure about the weighting, but might be required given the definition of the HSIC.) As I understand it, the HSIC loss is a kernellized loss between the input, output, and labels that encourages a degree of invariance to input ('covariate shift'). This is useful, but I'm unconvinced that making the layer output independent of the input is absolutely essential (??)  
{1543} 
ref: 2019
tags: backprop neural networks deep learning coordinate descent alternating minimization
date: 07212021 03:07 gmt
revision:1
[0] [head]


Beyond Backprop: Online Alternating Minimization with Auxiliary Variables
This is interesting in that the weight updates can be cone in parallel  perhaps more efficient  but you are still propagating errors backward, albeit via optimizing 'codes'. Given the vast infractructure devoted to autodiff + backprop, I can't see this being adopted broadly. That said, the idea of alternating minimization (which is used eg for EM clustering) is powerful, and this paper does describe (though I didn't read it) how there are guarantees on the convexity of the alternating minimization. Likewise, the authors show how to improve the performance of the online / minibatch algorithm by keeping around memory variables, in the form of covariance matrices.  
{1455}  
Conducting credit assignment by aligning local distributed representations
Lit review.
 
{1453}  
PMID22325196 Backpropagation through time and the brain
 
{1426}  
Training neural networks with local error signals
 
{1423}  
PMID27824044 Random synaptic feedback weights support error backpropagation for deep learning.
Our proof says that weights W0 and W evolve to equilibrium manifolds, but simulations (Fig. 4) and analytic results (Supple mentary Proof 2) hint at something more specific: that when the weights begin near 0, feedback alignment encourages W to act like a local pseudoinverse of B around the error manifold. This fact is important because if B were exactly W + (the Moore Penrose pseudoinverse of W ), then the network would be performing GaussNewton optimization (Supplementary Proof 3). We call this update rule for the hidden units pseudobackprop and denote it by âˆ†hPBP = W + e. Experiments with the linear net work show that the angle, âˆ†hFA ]âˆ†hPBP quickly becomes smaller than âˆ†hFA ]âˆ†hBP (Fig. 4b, c; see Methods). In other words feedback alignment, despite its simplicity, displays elements of secondorder learning.  
{1422}  
PMID29205151 Towards deep learning with segregated dendrites https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5716677/
 
{699} 
ref: Harris2008.03
tags: retroaxonal retrosynaptic Harris learning cortex backprop
date: 12072011 02:34 gmt
revision:2
[1] [0] [head]


PMID18255165[0] Stability of the fittest: organizing learning through retroaxonal signals
____References____
 
{862} 
ref: 0
tags: backpropagation cascade correlation neural networks
date: 12202010 06:28 gmt
revision:1
[0] [head]


The CascadeCorrelation Learning Architecture
 
{634} 
ref: RAzsa2008.01
tags: nAChR nicotinic acetylchoine receptor interneurons backpropagating LTP hippocampus
date: 10082008 17:37 gmt
revision:0
[head]


PMID18215234[0] Dendritic nicotinic receptors modulate backpropagating action potentials and longterm plasticity of interneurons.
____References____  