Pay attention to MLPs
 Using bilinear / multiplicative gating + deep / wide networks, you can attain similar accuracies as Transformers on vision and masked language learning tasks! No attention needed, just a innetwork multiplicative term.
 And the math is quite straightforward. Per layer:
 $Z = \sigma(X U) ,, \hat{Z} = s(Z) ,, Y = \hat{Z} V$
 Where X is the layer input, $\sigma$ is the nonlinearity (GeLU), U is a weight matrix, $\hat{Z}$ is the spatiallygated Z, and V is another weight matrix.
 $s(Z) = Z_1 \odot (W Z_2 + b)$
 Where Z is divided into two parts along the channel dimension, $Z_1 Z_2$ . 'circleDot' is elementwise multiplication, and W is a weight matrix.

 You of course need a lot of compute; this paper has nice figures of model accuracy scaling vs. depth / number of parameters / size. I guess you can do this if you're Google.
Pretty remarkable that an industrial lab freely publishes results like this. I guess the ROI is that they get the resultant improved ideas? Or, perhaps, Google is in such a dominant position in terms of data and compute that even if they give away ideas and code, provided some of the resultant innovation returns to them, they win. The return includes trained people as well as ideas. Good for us, I guess! 
Local synaptic learning rules suffice to maximize mutual information in a linear network
 Ralph Linsker, 1992.
 A development upon {1545}  this time with lateral inhibition trained through noisecontrast and antiHebbian plasticity.
 {1545} does not perfectly maximize the mutual information between the input and output  this allegedly requires the inverse of the covariance matrix, $Q$ .
 As before, infomax principles; maximize mutual information $MI \propto H(Z)  H(Z  S)$ where Z is the network output and S is the signal input. (note: minimize the conditional entropy of output given the input).
 For a gaussian variable, $H = \frac{ 1}{ 2} ln det Q$ where Q is the covariance matrix. In this case $Q = EZ Z^T $
 since $Z = C(S,N)$ where C are the weights, S is the signal, and N is the noise, $Q = C q C^T + r$ where q is the covariance matrix of input noise and r is the cov.mtx. of the output noise.
 (somewhat confusing): $\delta H / \delta C = Q^{1}Cq$
 because .. the derivative of the determinant is complicated.
 Check the appendix for the derivation. $ln det Q = Tr ln Q$ and $dH = 1/2 d(Tr ln Q) = 1/2 Tr( Q^1 dQ )$  this holds for positive semidefinite matrices like Q.
 From this he comes up with a set of rules whereby feedforward weights are trained in a Hebbian fashion, but based on activity after lateral activation.
 The lateral activation has a weight matrix $F = I  \alpha Q$ (again Q is the cov.mtx. of Z). If $y(0) = Y; y(t+1) = Y + Fy(t)$ , where Y is the feedforward activation, then $\alpha y(\inf) = Q^{1}Y$ . This checks out:
x = randn(1000, 10);
Q = x' * x;
a = 0.001;
Y = randn(10, 1);
y = zeros(10, 1);
for i = 1:1000
y = Y + (eye(10)  a*Q)*y;
end
y  pinv(Q)*Y / a % should be zero.
 This recursive definition is from Jacobi. $\alpha y(\inf) = \alpha \Sigma_{t=0}^{\inf}F^tY = \alpha(I  F)^{1} Y = Q^{1}Y$ .
 Still, you need to estimate Q through a runningaverage, $\Delta Q = \frac{ 1}{M}( Y_n Y_m + r_{nm}  Q_{NM} )$ and since $F = I  \alpha Q$ , F is formed via antihebbian terms.
To this is added a 'sensing' learning and 'noise' unlearning phase  one optimizes $H(Z)$ , the other minimizes $H(ZS)$ . Everything is then applied, similar to before, to a gaussianfiltered onedimensional whitenoise stimuli. He shows this results in bandpass filter behavior  quite weak sauce in an era where ML papers are expected to test on five or so datasets. Even if this was 1992 (nearly forty years ago!), it would have been nice to see this applied to a more realistic dataset; perhaps some of the following papers? Olshausen & Field came out in 1996  but they applied their algorithm to real images.
In both Olshausen & this work, no affordances are made for multiple layers. There have to be solutions out there...
