You are not authenticated, login.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -2021 tags: gated multi layer perceptrons transformers ML Quoc_Le Google_Brain date: 08-05-2021 06:00 gmt revision:4 [3] [2] [1] [0] [head]

Pay attention to MLPs

  • Using bilinear / multiplicative gating + deep / wide networks, you can attain similar accuracies as Transformers on vision and masked language learning tasks! No attention needed, just a in-network multiplicative term.
  • And the math is quite straightforward. Per layer:
    • Z=σ(XU),,Z^=s(Z),,Y=Z^V Z = \sigma(X U) ,, \hat{Z} = s(Z) ,, Y = \hat{Z} V
      • Where X is the layer input, σ\sigma is the nonlinearity (GeLU), U is a weight matrix, Z^\hat{Z} is the spatially-gated Z, and V is another weight matrix.
    • s(Z)=Z 1(WZ 2+b) s(Z) = Z_1 \odot (W Z_2 + b)
      • Where Z is divided into two parts along the channel dimension, Z 1Z 2Z_1 Z_2 . 'circleDot' is element-wise multiplication, and W is a weight matrix.
  • You of course need a lot of compute; this paper has nice figures of model accuracy scaling vs. depth / number of parameters / size. I guess you can do this if you're Google.

Pretty remarkable that an industrial lab freely publishes results like this. I guess the ROI is that they get the resultant improved ideas? Or, perhaps, Google is in such a dominant position in terms of data and compute that even if they give away ideas and code, provided some of the resultant innovation returns to them, they win. The return includes trained people as well as ideas. Good for us, I guess!

hide / / print
ref: -0 tags: lillicrap segregated dendrites deep learning backprop date: 01-31-2019 19:24 gmt revision:2 [1] [0] [head]

PMID-29205151 Towards deep learning with segregated dendrites https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5716677/

  • Much emphasis on the problem of credit assignment in biological neural networks.
    • That is: given complex behavior, how do upstream neurons change to improve the task of downstream neurons?
    • Or: given downstream neurons, how do upstream neurons receive ‘credit’ for informing behavior?
      • I find this a very limiting framework, and is one of my chief beefs with the work.
      • Spatiotemporal Bayesian structure seems like a much better axis (axes) to cast function against.
      • Or, it could be segregation into ‘signal’ and ‘error’ or ‘figure/ground’ based on hierarchical spatio-temporal statistical properties that matters ...
      • ... with proper integration of non-stochastic spike timing + neoSTDP.
        • This still requires some solution of the credit-assignment problem, i know i know.
  • Outline a spiking neuron model with zero one or two hidden layers, and a segregated apical (feedback) and basal (feedforward) dendrites, as per a layer 5 pyramidal neuron.
  • The apical dendrites have plateau potentials, which are stimulated through (random) feedback weights from the output neurons.
  • Output neurons are forced to one-hot activation at maximum firing rate during training.
    • In order to assign credit, feedforward information must be integrated separately from any feedback signals used to calculate error for synaptic updates (the error is indicated here with δ). (B) Illustration of the segregated dendrites proposal. Rather than using a separate pathway to calculate error based on feedback, segregated dendritic compartments could receive feedback and calculate the error signals locally.
  • Uses the MNIST database, naturally.
  • Poisson spiking input neurons, 784, again natch.
  • Derive local loss function learning rules to make the plateau potential (from the feedback weights) match the feedforward potential
    • This encourages the hidden layer -> output layer to approximate the inverse of the random feedback weight network -- which it does! (At least, the jacobians are inverses of each other).
    • The matching is performed in two phases -- feedforward and feedback. This itself is not biologically implausible, just unlikely.
  • Achieved moderate performance on MNIST, ~ 4%, which improved with 2 hidden layers.
  • Very good, interesting scholarship on the relevant latest findings ‘’in vivo’’.
  • While the model seems workable though ad-hoc or just-so, the scholarship points to something better: use of multiple neuron subtypes to accomplish different elements (variables) in the random-feedback credit assignment algorithm.
    • These small models can be tuned to do this somewhat simple task through enough fiddling & manual (e.g. in the algorithmic space, not weight space) backpropagation of errors.
  • They suggest that the early phases of learning may entail learning the feedback weights -- fascinating.
  • ‘’Things are definitely moving forward’’.