You are not authenticated, login.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -2021 tags: gated multi layer perceptrons transformers ML Quoc_Le Google_Brain date: 08-05-2021 06:00 gmt revision:4 [3] [2] [1] [0] [head]

Pay attention to MLPs

  • Using bilinear / multiplicative gating + deep / wide networks, you can attain similar accuracies as Transformers on vision and masked language learning tasks! No attention needed, just a in-network multiplicative term.
  • And the math is quite straightforward. Per layer:
    • Z=σ(XU),,Z^=s(Z),,Y=Z^V Z = \sigma(X U) ,, \hat{Z} = s(Z) ,, Y = \hat{Z} V
      • Where X is the layer input, σ\sigma is the nonlinearity (GeLU), U is a weight matrix, Z^\hat{Z} is the spatially-gated Z, and V is another weight matrix.
    • s(Z)=Z 1(WZ 2+b) s(Z) = Z_1 \odot (W Z_2 + b)
      • Where Z is divided into two parts along the channel dimension, Z 1Z 2Z_1 Z_2 . 'circleDot' is element-wise multiplication, and W is a weight matrix.
  • You of course need a lot of compute; this paper has nice figures of model accuracy scaling vs. depth / number of parameters / size. I guess you can do this if you're Google.

Pretty remarkable that an industrial lab freely publishes results like this. I guess the ROI is that they get the resultant improved ideas? Or, perhaps, Google is in such a dominant position in terms of data and compute that even if they give away ideas and code, provided some of the resultant innovation returns to them, they win. The return includes trained people as well as ideas. Good for us, I guess!

hide / / print
ref: -1992 tags: Linsker infomax Hebbian anti-hebbian linear perceptron unsupervised learning date: 08-04-2021 00:20 gmt revision:2 [1] [0] [head]

Local synaptic learning rules suffice to maximize mutual information in a linear network

  • Ralph Linsker, 1992.
  • A development upon {1545} -- this time with lateral inhibition trained through noise-contrast and anti-Hebbian plasticity.
  • {1545} does not perfectly maximize the mutual information between the input and output -- this allegedly requires the inverse of the covariance matrix, QQ .
    • As before, infomax principles; maximize mutual information MIH(Z)H(Z|S)MI \propto H(Z) - H(Z | S) where Z is the network output and S is the signal input. (note: minimize the conditional entropy of output given the input).
    • For a gaussian variable, H=12lndetQH = \frac{ 1}{ 2} ln det Q where Q is the covariance matrix. In this case Q=E|ZZ T|Q = E|Z Z^T |
    • since Z=C(S,N)Z = C(S,N) where C are the weights, S is the signal, and N is the noise, Q=CqC T+rQ = C q C^T + r where q is the covariance matrix of input noise and r is the cov.mtx. of the output noise.
    • (somewhat confusing): δH/δC=Q 1Cq\delta H / \delta C = Q^{-1}Cq
      • because .. the derivative of the determinant is complicated.
      • Check the appendix for the derivation. lndetQ=TrlnQln det Q = Tr ln Q and dH=1/2d(TrlnQ)=1/2Tr(Q 1dQ) dH = 1/2 d(Tr ln Q) = 1/2 Tr( Q^-1 dQ ) -- this holds for positive semidefinite matrices like Q.

  • From this he comes up with a set of rules whereby feedforward weights are trained in a Hebbian fashion, but based on activity after lateral activation.
  • The lateral activation has a weight matrix F=IαQF = I - \alpha Q (again Q is the cov.mtx. of Z). If y(0)=Y;y(t+1)=Y+Fy(t)y(0) = Y; y(t+1) = Y + Fy(t) , where Y is the feed-forward activation, then αy(inf)=Q 1Y\alpha y(\inf) = Q^{-1}Y . This checks out:
x = randn(1000, 10);
Q = x' * x;
a = 0.001;
Y = randn(10, 1);
y = zeros(10, 1); 
for i = 1:1000
	y = Y + (eye(10) - a*Q)*y;

y - pinv(Q)*Y / a % should be zero. 
  • This recursive definition is from Jacobi. αy(inf)=αΣ t=0 infF tY=α(IF) 1Y=Q 1Y\alpha y(\inf) = \alpha \Sigma_{t=0}^{\inf}F^tY = \alpha(I - F)^{-1} Y = Q^{-1}Y .
  • Still, you need to estimate Q through a running-average, ΔQ=1M(Y nY m+r nmQ NM)\Delta Q = \frac{ 1}{M}( Y_n Y_m + r_{nm} - Q_{NM} ) and since F=IαQF = I - \alpha Q , F is formed via anti-hebbian terms.

To this is added a 'sensing' learning and 'noise' unlearning phase -- one optimizes H(Z)H(Z) , the other minimizes H(Z|S)H(Z|S) . Everything is then applied, similar to before, to a gaussian-filtered one-dimensional white-noise stimuli. He shows this results in bandpass filter behavior -- quite weak sauce in an era where ML papers are expected to test on five or so datasets. Even if this was 1992 (nearly forty years ago!), it would have been nice to see this applied to a more realistic dataset; perhaps some of the following papers? Olshausen & Field came out in 1996 -- but they applied their algorithm to real images.

In both Olshausen & this work, no affordances are made for multiple layers. There have to be solutions out there...

hide / / print
ref: bookmark-0 tags: neural_networks machine_learning matlab toolbox supervised_learning PCA perceptron SOM EM date: 0-0-2006 0:0 revision:0 [head]

http://www.ncrg.aston.ac.uk/netlab/index.php n.b. kinda old. (or does that just mean well established?)