 m8ta
 {1546} hide / / print ref: -1992 tags: Linsker infomax Hebbian anti-hebbian linear perceptron unsupervised learning date: 08-04-2021 00:20 gmt revision:2   [head] Ralph Linsker, 1992. A development upon {1545} -- this time with lateral inhibition trained through noise-contrast and anti-Hebbian plasticity. {1545} does not perfectly maximize the mutual information between the input and output -- this allegedly requires the inverse of the covariance matrix, $Q$ . As before, infomax principles; maximize mutual information $MI \propto H(Z) - H(Z | S)$ where Z is the network output and S is the signal input. (note: minimize the conditional entropy of output given the input). For a gaussian variable, $H = \frac{ 1}{ 2} ln det Q$ where Q is the covariance matrix. In this case $Q = E|Z Z^T |$ since $Z = C(S,N)$ where C are the weights, S is the signal, and N is the noise, $Q = C q C^T + r$ where q is the covariance matrix of input noise and r is the cov.mtx. of the output noise. (somewhat confusing): $\delta H / \delta C = Q^{-1}Cq$ because .. the derivative of the determinant is complicated. Check the appendix for the derivation. $ln det Q = Tr ln Q$ and $dH = 1/2 d(Tr ln Q) = 1/2 Tr( Q^-1 dQ )$ -- this holds for positive semidefinite matrices like Q. From this he comes up with a set of rules whereby feedforward weights are trained in a Hebbian fashion, but based on activity after lateral activation. The lateral activation has a weight matrix $F = I - \alpha Q$ (again Q is the cov.mtx. of Z). If $y(0) = Y; y(t+1) = Y + Fy(t)$ , where Y is the feed-forward activation, then $\alpha y(\inf) = Q^{-1}Y$ . This checks out: x = randn(1000, 10); Q = x' * x; a = 0.001; Y = randn(10, 1); y = zeros(10, 1); for i = 1:1000 y = Y + (eye(10) - a*Q)*y; end y - pinv(Q)*Y / a % should be zero.  This recursive definition is from Jacobi. $\alpha y(\inf) = \alpha \Sigma_{t=0}^{\inf}F^tY = \alpha(I - F)^{-1} Y = Q^{-1}Y$ . Still, you need to estimate Q through a running-average, $\Delta Q = \frac{ 1}{M}( Y_n Y_m + r_{nm} - Q_{NM} )$ and since $F = I - \alpha Q$ , F is formed via anti-hebbian terms. To this is added a 'sensing' learning and 'noise' unlearning phase -- one optimizes $H(Z)$ , the other minimizes $H(Z|S)$ . Everything is then applied, similar to before, to a gaussian-filtered one-dimensional white-noise stimuli. He shows this results in bandpass filter behavior -- quite weak sauce in an era where ML papers are expected to test on five or so datasets. Even if this was 1992 (nearly forty years ago!), it would have been nice to see this applied to a more realistic dataset; perhaps some of the following papers? Olshausen & Field came out in 1996 -- but they applied their algorithm to real images. In both Olshausen & this work, no affordances are made for multiple layers. There have to be solutions out there... {1545} hide / / print ref: -1988 tags: Linsker infomax linear neural network hebbian learning unsupervised date: 08-03-2021 06:12 gmt revision:2   [head] Ralph Linsker, 1988. One of the first (verbose, slightly diffuse) investigations of the properties of linear projection neurons (e.g. dot-product; no non-linearity) to express useful tuning functions. ''Useful' is here information-preserving, in the face of noise or dimensional bottlenecks (like PCA). Starts with Hebbian learning functions, and shows that this + white-noise sensory input + some local topology, you can get simple and complex visual cell responses. Ralph notes that neurons in primate visual cortex are tuned in utero -- prior real-world visual experience! Wow. (Who did these studies?) This is a very minimalistic starting point; there isn't even structured stimuli (!) Single neuron (and later, multiple neurons) are purely feed-forward; author cautions that a lack of feedback is not biologically realistic. Also note that this was back in the Motorola 680x0 days ... computers were not that powerful (but certainly could handle more than 1-2 neurons!) Linear algebra shows that Hebbian synapses cause a linear layer to learn the covariance function of their inputs, $Q$ , with no dependence on the actual layer activity. When looked at in terms of an energy function, this is equivalent to gradient descent to maximize the layer-output variance. He also hits on: Hopfield networks, PCA, Oja's constrained Hebbian rule $\delta w_i \propto < L_2(L_1 - L_2 w_i) >$ (that is, a quadratic constraint on the weight to make $\Sigma w^2 \sim 1$ ) Optimal linear reconstruction in the presence of noise Mutual information between layer input and output (I found this to be a bit hand-wavey) Yet he notes critically: "but it is not true that maximum information rate and maximum activity variance coincide when the probability distribution of signals is arbitrary". Indeed. The world is characterized by very non-Gaussian structured sensory stimuli. Redundancy and diversity in 2-neuron coding model. Role of infomax in maximizing the determinant of the weight matrix, sorta. One may critically challenge the infomax idea: we very much need to (and do) throw away spurious or irrelevant information in our sensory streams; what upper layers 'care about' when making decisions is certainly relevant to the lower layers. This credit-assignment is neatly solved by backprop, and there are a number 'biologically plausible' means of performing it, but both this and infomax are maybe avoiding the problem. What might the upper layers really care about? Likely 'care about' is an emergent property of the interacting local learning rules and network structure. Can you search directly in these domains, within biological limits, and motivated by statistical reality, to find unsupervised-learning networks? You'll still need a way to rank the networks, hence an objective 'care about' function. Sigh. Either way, I don't per se put a lot of weight in the infomax principle. It could be useful, but is only part of the story. Otherwise Linsker's discussion is accessible, lucid, and prescient. Lol.