{1546} revision 2 modified: 08-04-2021 00:20 gmt

Local synaptic learning rules suffice to maximize mutual information in a linear network

  • Ralph Linsker, 1992.
  • A development upon {1545} -- this time with lateral inhibition trained through noise-contrast and anti-Hebbian plasticity.
  • {1545} does not perfectly maximize the mutual information between the input and output -- this allegedly requires the inverse of the covariance matrix, QQ .
    • As before, infomax principles; maximize mutual information MIH(Z)H(Z|S)MI \propto H(Z) - H(Z | S) where Z is the network output and S is the signal input. (note: minimize the conditional entropy of output given the input).
    • For a gaussian variable, H=12lndetQH = \frac{ 1}{ 2} ln det Q where Q is the covariance matrix. In this case Q=E|ZZ T|Q = E|Z Z^T |
    • since Z=C(S,N)Z = C(S,N) where C are the weights, S is the signal, and N is the noise, Q=CqC T+rQ = C q C^T + r where q is the covariance matrix of input noise and r is the cov.mtx. of the output noise.
    • (somewhat confusing): δH/δC=Q 1Cq\delta H / \delta C = Q^{-1}Cq
      • because .. the derivative of the determinant is complicated.
      • Check the appendix for the derivation. lndetQ=TrlnQln det Q = Tr ln Q and dH=1/2d(TrlnQ)=1/2Tr(Q 1dQ) dH = 1/2 d(Tr ln Q) = 1/2 Tr( Q^-1 dQ ) -- this holds for positive semidefinite matrices like Q.

  • From this he comes up with a set of rules whereby feedforward weights are trained in a Hebbian fashion, but based on activity after lateral activation.
  • The lateral activation has a weight matrix F=IαQF = I - \alpha Q (again Q is the cov.mtx. of Z). If y(0)=Y;y(t+1)=Y+Fy(t)y(0) = Y; y(t+1) = Y + Fy(t) , where Y is the feed-forward activation, then αy(inf)=Q 1Y\alpha y(\inf) = Q^{-1}Y . This checks out:
x = randn(1000, 10);
Q = x' * x;
a = 0.001;
Y = randn(10, 1);
y = zeros(10, 1); 
for i = 1:1000
	y = Y + (eye(10) - a*Q)*y;
end

y - pinv(Q)*Y / a % should be zero. 
  • This recursive definition is from Jacobi. αy(inf)=αΣ t=0 infF tY=α(IF) 1Y=Q 1Y\alpha y(\inf) = \alpha \Sigma_{t=0}^{\inf}F^tY = \alpha(I - F)^{-1} Y = Q^{-1}Y .
  • Still, you need to estimate Q through a running-average, ΔQ=1M(Y nY m+r nmQ NM)\Delta Q = \frac{ 1}{M}( Y_n Y_m + r_{nm} - Q_{NM} ) and since F=IαQF = I - \alpha Q , F is formed via anti-hebbian terms.

To this is added a 'sensing' learning and 'noise' unlearning phase -- one optimizes H(Z)H(Z) , the other minimizes H(Z|S)H(Z|S) . Everything is then applied, similar to before, to a gaussian-filtered one-dimensional white-noise stimuli. He shows this results in bandpass filter behavior -- quite weak sauce in an era where ML papers are expected to test on five or so datasets. Even if this was 1992 (nearly forty years ago!), it would have been nice to see this applied to a more realistic dataset; perhaps some of the following papers? Olshausen & Field came out in 1996 -- but they applied their algorithm to real images.

In both Olshausen & this work, no affordances are made for multiple layers. There have to be solutions out there...