Pay attention to MLPs
 Using bilinear / multiplicative gating + deep / wide networks, you can attain similar accuracies as Transformers on vision and masked language learning tasks! No attention needed, just a innetwork multiplicative term.
 And the math is quite straightforward. Per layer:
 $Z = \sigma(X U) ,, \hat{Z} = s(Z) ,, Y = \hat{Z} V$
 Where X is the layer input, $\sigma$ is the nonlinearity (GeLU), U is a weight matrix, $\hat{Z}$ is the spatiallygated Z, and V is another weight matrix.
 $s(Z) = Z_1 \odot (W Z_2 + b)$
 Where Z is divided into two parts along the channel dimension, $Z_1 Z_2$ . 'circleDot' is elementwise multiplication, and W is a weight matrix.

 You of course need a lot of compute; this paper has nice figures of model accuracy scaling vs. depth / number of parameters / size. I guess you can do this if you're Google.
Pretty remarkable that an industrial lab freely publishes results like this. I guess the ROI is that they get the resultant improved ideas? Or, perhaps, Google is in such a dominant position in terms of data and compute that even if they give away ideas and code, provided some of the resultant innovation returns to them, they win. The return includes trained people as well as ideas. Good for us, I guess! 