Local synaptic learning rules suffice to maximize mutual information in a linear network
 Ralph Linsker, 1992.
 A development upon {1545}  this time with lateral inhibition trained through noisecontrast and antiHebbian plasticity.
 {1545} does not perfectly maximize the mutual information between the input and output  this allegedly requires the inverse of the covariance matrix, $Q$ .
 As before, infomax principles; maximize mutual information $MI \propto H(Z)  H(Z  S)$ where Z is the network output and S is the signal input. (note: minimize the conditional entropy of output given the input).
 For a gaussian variable, $H = \frac{ 1}{ 2} ln det Q$ where Q is the covariance matrix. In this case $Q = EZ Z^T $
 since $Z = C(S,N)$ where C are the weights, S is the signal, and N is the noise, $Q = C q C^T + r$ where q is the covariance matrix of input noise and r is the cov.mtx. of the output noise.
 (somewhat confusing): $\delta H / \delta C = Q^{1}Cq$
 because .. the derivative of the determinant is complicated.
 Check the appendix for the derivation. $ln det Q = Tr ln Q$ and $dH = 1/2 d(Tr ln Q) = 1/2 Tr( Q^1 dQ )$  this holds for positive semidefinite matrices like Q.
 From this he comes up with a set of rules whereby feedforward weights are trained in a Hebbian fashion, but based on activity after lateral activation.
 The lateral activation has a weight matrix $F = I  \alpha Q$ (again Q is the cov.mtx. of Z). If $y(0) = Y; y(t+1) = Y + Fy(t)$ , where Y is the feedforward activation, then $\alpha y(\inf) = Q^{1}Y$ . This checks out:
x = randn(1000, 10);
Q = x' * x;
a = 0.001;
Y = randn(10, 1);
y = zeros(10, 1);
for i = 1:1000
y = Y + (eye(10)  a*Q)*y;
end
y  pinv(Q)*Y / a % should be zero.
 This recursive definition is from Jacobi. $\alpha y(\inf) = \alpha \Sigma_{t=0}^{\inf}F^tY = \alpha(I  F)^{1} Y = Q^{1}Y$ .
 Still, you need to estimate Q through a runningaverage, $\Delta Q = \frac{ 1}{M}( Y_n Y_m + r_{nm}  Q_{NM} )$ and since $F = I  \alpha Q$ , F is formed via antihebbian terms.
To this is added a 'sensing' learning and 'noise' unlearning phase  one optimizes $H(Z)$ , the other minimizes $H(ZS)$ . Everything is then applied, similar to before, to a gaussianfiltered onedimensional whitenoise stimuli. He shows this results in bandpass filter behavior  quite weak sauce in an era where ML papers are expected to test on five or so datasets. Even if this was 1992 (nearly forty years ago!), it would have been nice to see this applied to a more realistic dataset; perhaps some of the following papers? Olshausen & Field came out in 1996  but they applied their algorithm to real images.
In both Olshausen & this work, no affordances are made for multiple layers. There have to be solutions out there...

Selforganizaton in a perceptual network
 Ralph Linsker, 1988.
 One of the first (verbose, slightly diffuse) investigations of the properties of linear projection neurons (e.g. dotproduct; no nonlinearity) to express useful tuning functions.
 ''Useful' is here informationpreserving, in the face of noise or dimensional bottlenecks (like PCA).
 Starts with Hebbian learning functions, and shows that this + whitenoise sensory input + some local topology, you can get simple and complex visual cell responses.
 Ralph notes that neurons in primate visual cortex are tuned in utero  prior realworld visual experience! Wow. (Who did these studies?)
 This is a very minimalistic starting point; there isn't even structured stimuli (!)
 Single neuron (and later, multiple neurons) are purely feedforward; author cautions that a lack of feedback is not biologically realistic.
 Also note that this was back in the Motorola 680x0 days ... computers were not that powerful (but certainly could handle more than 12 neurons!)
 Linear algebra shows that Hebbian synapses cause a linear layer to learn the covariance function of their inputs, $Q$ , with no dependence on the actual layer activity.
 When looked at in terms of an energy function, this is equivalent to gradient descent to maximize the layeroutput variance.
 He also hits on:
 Hopfield networks,
 PCA,
 Oja's constrained Hebbian rule $\delta w_i \propto < L_2(L_1  L_2 w_i) >$ (that is, a quadratic constraint on the weight to make $\Sigma w^2 \sim 1$ )
 Optimal linear reconstruction in the presence of noise
 Mutual information between layer input and output (I found this to be a bit handwavey)
 Yet he notes critically: "but it is not true that maximum information rate and maximum activity variance coincide when the probability distribution of signals is arbitrary".
 Indeed. The world is characterized by very nonGaussian structured sensory stimuli.
 Redundancy and diversity in 2neuron coding model.
 Role of infomax in maximizing the determinant of the weight matrix, sorta.
One may critically challenge the infomax idea: we very much need to (and do) throw away spurious or irrelevant information in our sensory streams; what upper layers 'care about' when making decisions is certainly relevant to the lower layers. This creditassignment is neatly solved by backprop, and there are a number 'biologically plausible' means of performing it, but both this and infomax are maybe avoiding the problem. What might the upper layers really care about? Likely 'care about' is an emergent property of the interacting local learning rules and network structure. Can you search directly in these domains, within biological limits, and motivated by statistical reality, to find unsupervisedlearning networks?
You'll still need a way to rank the networks, hence an objective 'care about' function. Sigh. Either way, I don't per se put a lot of weight in the infomax principle. It could be useful, but is only part of the story. Otherwise Linsker's discussion is accessible, lucid, and prescient.
Lol. 