You are not authenticated, login.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -2021 tags: gated multi layer perceptrons transformers ML Quoc_Le Google_Brain date: 08-05-2021 06:00 gmt revision:4 [3] [2] [1] [0] [head]

Pay attention to MLPs

  • Using bilinear / multiplicative gating + deep / wide networks, you can attain similar accuracies as Transformers on vision and masked language learning tasks! No attention needed, just a in-network multiplicative term.
  • And the math is quite straightforward. Per layer:
    • Z=σ(XU),,Z^=s(Z),,Y=Z^V Z = \sigma(X U) ,, \hat{Z} = s(Z) ,, Y = \hat{Z} V
      • Where X is the layer input, σ\sigma is the nonlinearity (GeLU), U is a weight matrix, Z^\hat{Z} is the spatially-gated Z, and V is another weight matrix.
    • s(Z)=Z 1(WZ 2+b) s(Z) = Z_1 \odot (W Z_2 + b)
      • Where Z is divided into two parts along the channel dimension, Z 1Z 2Z_1 Z_2 . 'circleDot' is element-wise multiplication, and W is a weight matrix.
  • You of course need a lot of compute; this paper has nice figures of model accuracy scaling vs. depth / number of parameters / size. I guess you can do this if you're Google.

Pretty remarkable that an industrial lab freely publishes results like this. I guess the ROI is that they get the resultant improved ideas? Or, perhaps, Google is in such a dominant position in terms of data and compute that even if they give away ideas and code, provided some of the resultant innovation returns to them, they win. The return includes trained people as well as ideas. Good for us, I guess!

hide / / print
ref: -2018 tags: luke metz meta learning google brain sgd model mnist Hebbian date: 08-05-2021 01:07 gmt revision:2 [1] [0] [head]

Meta-Learning Update Rules for Unsupervised Representation Learning

  • Central idea: meta-train a training-network (a MLP) which trains a task-network (also a MLP) to do unsupervised learning on one dataset.
  • The training network is optimized through SGD based on small-shot linear learning on a test set, typically different from the unsupervised training set.
  • The training-network is a per-weight MLP which takes in layer input, layer output, and a synthetic error (denoted η\eta ), and generates a and b, which are then fed into an outer-product Hebbian learning rule.
  • η\eta itself is formed through a backward pass through weights VV , which affords something like backprop -- but not exactly backprop, of course. See the figure.
  • Training consists of building up very large, backward through time gradient estimates relative to the parameters of the training-network. (And there are a lot!)
  • Trained on CIFAR10, MNIST, FashionMNIST, IMDB sentiment prediction. All have their input permuted to keep the training-network from learning per-task weights. Instead the network should learn to interpret the statistics between datapoints.
  • Indeed, it does this -- albeit with limits. Performance is OK, but only if you only do supervised learning on the very limited dataset used in the meta-optimization.
    • In practice, it's possible to completely solve tasks like MNIST with supervised learning; this gets to about 80% accuracy.
  • Images were kept small -- about 20x20 -- to speed up the inner loop unsupervised learning. Still, this took on the order of 200 hours across ~500 TPUs.
  • See, as a comparison, Keren's paper, Meta-learning biologically plausible semi-supervised update rules. It's conceptually nice but only evaluates the two-moons and two-gaussian datasets.

This is a clearly-written, easy to understand paper. The results are not highly compelling, but as a first set of experiments, it's successful enough.

I wonder what more constraints (fewer parameters, per the genome), more options for architecture modifications (e.g. different feedback schemes, per neurobiology), and a black-box optimization algorithm (evolution) would do?

hide / / print
ref: -2017 tags: google deepmind compositional variational autoencoder date: 04-08-2020 01:16 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

SCAN: learning hierarchical compositional concepts

  • From DeepMind, first version Jul 2017 / v3 June 2018.
  • Starts broad and strong:
    • "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules"
      • Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details..
    • "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts"
    • "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts."
    • "Compositionality is at the core of such human abilities as creativity, imagination, and language-based communication.
    • This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision.
  • Approach:
    • Factorize the visual world with a Β\Beta -VAE to learn a set of representational primitives through unsupervised exposure to visual data.
    • Expose SCAN (or rather, a module of it) to a small number of symbol-image pairs, from which the algorithm identifies the set if visual primitives (features from beta-VAE) that the examples have in common.
      • E.g. this is purely associative learning, with a finite one-layer association matrix.
    • Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..)
    • Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( \cup ), IN-COMMON ( \cap ) & IGNORE ( \setminus or '-'). This is via a low-parameter convolutional model.
  • Notation:
    • q ϕ(z x|x)q_{\phi}(z_x|x) is the encoder model. ϕ\phi are the encoder parameters, xx is the visual input, z xz_x are the latent parameters inferred from the scene.
    • p theta(x|z x)p_{theta}(x|z_x) is the decoder model. xp θ(x|z x)x \propto p_{\theta}(x|z_x) , θ\theta are the decoder parameters. xx is now the reconstructed scene.
  • From this, the loss function of the beta-VAE is:
    • 𝕃(θ,ϕ;x,z x,β)=𝔼 q ϕ(z x|x)[logp θ(x|z x)]βD KL(q ϕ(z x|x)||p(z x)) \mathbb{L}(\theta, \phi; x, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_x|x)} [log p_{\theta}(x|z_x)] - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x)) where Β>1\Beta \gt 1
      • That is, maximize the auto-encoder fit (the expectation of the decoder, over the encoder output -- aka the pixel log-likelihood) minus the KL divergence between the encoder distribution and p(z x)p(z_x)
        • p(z)𝒩(0,I)p(z) \propto \mathcal{N}(0, I) -- diagonal normal matrix.
        • β\beta comes from the Lagrangian solution to the constrained optimization problem:
        • max ϕ,θ𝔼 xD[𝔼 q ϕ(z|x)[logp θ(x|z)]]\max_{\phi,\theta} \mathbb{E}_{x \sim D} [\mathbb{E}_{q_{\phi}(z|x)}[log p_{\theta}(x|z)]] subject to D KL(q ϕ(z|x)||p(z))<εD_{KL}(q_{\phi}(z|x)||p(z)) \lt \epsilon where D is the domain of images etc.
      • Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual de-tangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising auto-encoder ref, which uses the feature L2 norm instead of the pixel log-likelihood:
    • 𝕃(θ,ϕ;X,z x,β)=𝔼 q ϕ(z x|x)||J(x^)J(x)|| 2 2βD KL(q ϕ(z x|x)||p(z x)) \mathbb{L}(\theta, \phi; X, z_x, \beta) = -\mathbb{E}_{q_{\phi}(z_x|x)}||J(\hat{x}) - J(x)||_2^2 - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x)) where J: WxHxC NJ : \mathbb{R}^{W x H x C} \rightarrow \mathbb{R}^N maps from images to high-level features.
      • This J(x)J(x) is from another neural network (transfer learning) which learns features beforehand.
      • It's a multilayer perceptron denoising autoencoder [Vincent 2010].
  • The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs yy and the latent outputs from encoder z xz_x given xx .
  • In this way, they can present a description yy to the network, which is then recomposed into z yz_y , that then produces an image x^\hat{x} .
    • The whole network is trained by minimizing:
    • 𝕃 y(θ y,ϕ y;y,x,z y,β,λ)=1 st2 nd3 rd \mathbb{L}_y(\theta_y, \phi_y; y, x, z_y, \beta, \lambda) = 1^{st} - 2^{nd} - 3^{rd}
      • 1st term: 𝔼 q ϕ y(z y|y)[logp θ y(y|z y)] \mathbb{E}_{q_{\phi_y}(z_y|y)}[log p_{\theta_y} (y|z_y)] log-likelihood of the decoded symbols given encoded latents z yz_y
      • 2nd term: βD KL(q ϕ y(z y|y)||p(z y)) \beta D_{KL}(q_{\phi_y}(z_y|y) || p(z_y)) weighted KL divergence between encoded latents and diagonal normal prior.
      • 3rd term: λD KL(q ϕ x(z x|y)||q ϕ y(z y|y))\lambda D_{KL}(q_{\phi_x}(z_x|y) || q_{\phi_y}(z_y|y)) weighted KL divergence between latents from the images and latents from the description yy .
        • They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right.
  • Final element! A convolutional recombination element, implemented as a tensor product between z y1z_{y1} and z y2z_{y2} that outputs a one-hot encoding of set-operation that's fed to a (hardcoded?) transformation matrix.
    • I don't think this is great shakes. Could have done this with a small function; no need for a neural network.
    • Trained with very similar loss function as SCAN or the beta-VAE.

  • Testing:
  • They seem to have used a very limited subset of "DeepMind Lab" -- all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing.
  • This is marginally more interesting -- the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.)
  • Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs.

hide / / print
ref: -2002 tags: hashing frequent items count sketch algorithm google date: 03-30-2020 02:04 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

Finding frequent items in data streams

  • Notation:
    • S is a data stream, S=q 1,q 2,...,q n S = q_1, q_2, ..., q_n length n.
    • Each object q iO=o 1,...o mq_i \in O = {o_1, ... o_m} That is, there are m total possible objects (e.g. English words).
    • Object o i o_i occurs n in_i times in S. The o no_n are ordered so that n 1n 2n m n_1 \geq n_2 \geq n_m .
  • Task:
    • Given an input stream S, integer k, and real ε\epsilon
    • Output a list of k elements from S such that each element has n i>(1ε)n k n_i \gt (1-\epsilon)n_k .
      • That is, if the ordering is perfect, n in k n_i \geq n_k , with equality on the last element.
  • Algorithm:
    • h 1,...,h th_1, ..., h_t hashes from object q to buckets 1,...,b{1, ..., b}
    • s 1,...,s ts_1, ..., s_t hashes from object q to 1,+1{-1, +1}
    • For each symbol, add it to the 2D hash array by hashing first with h ih_i , then increment that counter with s is_i .
      • The double-hasihing is to reduce the effect of collisions with high-frequency items.
    • When querying for frequency of a object, hash like others, and take the median over i of h i[q]*s i[q] h_i[q] * s_i[q]
    • t=O(log(nδ))t = O(log(\frac{n}{\delta})) where the algorithm fails with at most probability δ\delta
  • Demonstrate proof of convergence / function with Zipfian distributions with varying exponent. (I did not read through this).
  • Also showed that it's possible to compare these hash-counts directly to see what's changed,or importantly if the documents are different.

Mission: Ultra large-scale feature selection using Count-Sketches
  • Task:
    • Given a labeled dataset (X i,y i)(X_i, y_i) for i1,2,...,ni \in {1,2, ..., n} and X i p,y iX_i \in \mathbb{R}^p, y_i \in \mathbb{R}
    • Find the k-sparse feature vector / linear regression for the mean squares problem min||B|| 0=k||yXΒ|| 2 \frac{min}{||B||_0=k} ||y-X\Beta||_2
      • ||B|| 0=k ||B||_0=k counts the non-zero elements in the feature vector.
    • THE number of features pp is so large that a dense Β\Beta cannot be stored in memory. (X is of course sparse).
  • Such data may be from ad click-throughs, or from genomic analyses ...
  • Use the count-sketch algorithm (above) for capturing & continually updating the features for gradient update.
    • That is, treat the stream of gradient updates, in the normal form g i=2λ(y iX iΒ iX t) tX ig_i = 2 \lambda (y_i - X_i \Beta_i X^t)^t X_i , as the semi-continuous time series used above as SS
  • Compare this with greedy thresholding, Iterative hard thresholding (IHT) e.g. throw away gradient information after each batch.
    • This discards small gradients which may be useful for the regression problem.
  • Works better, but not necessarily better than straight feature hashing (FH).
  • Meh.

hide / / print
ref: -2019 tags: lillicrap google brain backpropagation through time temporal credit assignment date: 03-14-2019 20:24 gmt revision:2 [1] [0] [head]

PMID-22325196 Backpropagation through time and the brain

  • Timothy Lillicrap and Adam Santoro
  • Backpropagation through time: the 'canonical' expansion of backprop to assign credit in recurrent neural networks used in machine learning.
    • E.g. variable rol-outs, where the error is propagated many times through the recurrent weight matrix, W TW^T .
    • This leads to the exploding or vanishing gradient problem.
  • TCA = temporal credit assignment. What lead to this reward or error? How to affect memory to encourage or avoid this?
  • One approach is to simply truncate the error: truncated backpropagation through time (TBPTT). But this of course limits the horizon of learning.
  • The brain may do BPTT via replay in both the hippocampus and cortex Nat. Neuroscience 2007, thereby alleviating the need to retain long time histories of neuron activations (needed for derivative and credit assignment).
  • Less known method of TCA uses RTRL Real-time recurrent learning forward mode differentiation -- δh t/δθ\delta h_t / \delta \theta is computed and maintained online, often with synaptic weight updates being applied at each time step in which there is non-zero error. See A learning algorithm for continually running fully recurrent neural networks.
    • Big problem: A network with NN recurrent units requires O(N 3)O(N^3) storage and O(N 4)O(N^4) computation at each time-step.
    • Can be solved with Unbiased Online Recurrent optimization, which stores approximate but unbiased gradient estimates to reduce comp / storage.
  • Attention seems like a much better way of approaching the TCA problem: past events are stored externally, and the network learns a differentiable attention-alignment module for selecting these events.
    • Memory can be finite size, extending, or self-compressing.
    • Highlight the utility/necessity of content-addressable memory.
    • Attentional gating can eliminate the exploding / vanishing / corrupting gradient problems -- the gradient paths are skip-connections.
  • Biologically plausible: partial reactivation of CA3 memories induces re-activation of neocortical neurons responsible for initial encoding PMID-15685217 The organization of recent and remote memories. 2005

  • I remain reserved about the utility of thinking in terms of gradients when describing how the brain learns. Correlations, yes; causation, absolutely; credit assignment, for sure. Yet propagating gradients as a means for changing netwrok weights seems at best a part of the puzzle. So much of behavior and internal cognitive life involves explicit, conscious computation of cause and credit.
  • This leaves me much more sanguine about the use of external memory to guide behavior ... but differentiable attention? Hmm.

hide / / print
ref: -2017 tags: attention transformer language model youtube google tech talk date: 02-26-2019 20:28 gmt revision:3 [2] [1] [0] [head]

Attention is all you need

  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
  • Attention is all you need neural network models
  • Good summary, along with: The Illustrated Transformer (please refer to this!)
  • Łukasz Kaiser mentions a few times how fragile the network is -- how easy it is to make something that doesn't train at all, or how many tricks by google experts were needed to make things work properly. it might be bravado or bluffing, but this is arguably not the way that biology fails.
  • Encoding:
  • Input is words encoded as 512-length vectors.
  • Vectors are transformed into length 64 vectors: query, key and value via differentiable weight matrices.
  • Attention is computed as the dot-product of the query (current input word) with the keys (values of the other words).
    • This value is scaled and passed through a softmax function to result in one attentional signal scaling the value.
  • Multiple heads' output are concatenated together, and this output is passed through a final weight matrix to produce a final value for the next layer.
    • So, attention in this respect looks like a conditional gain field.
  • 'Final value' above is then passed through a single layer feedforward net, with resnet style jump.
  • Decoding:
  • Use the attentional key value from the encoder to determine the first word through the output encoding (?) Not clear.
  • Subsequent causal decodes depend on the already 'spoken' words, plus the key-values from the encoder.
  • Output is a one-hot softmax layer from a feedforward layer; the sum total is differentiable from input to output using cross-entropy loss or KL divergence.

hide / / print
ref: -0 tags: Hinton google tech talk dropout deep neural networks Boltzmann date: 02-12-2019 08:03 gmt revision:2 [1] [0] [head]

Brains, sex, and machine learning -- Hinton google tech talk.

  • Hinton believes in the the power of crowds -- he thinks that the brain fits many, many different models to the data, then selects afterward.
    • Random forests, as used in predator, is an example of this: they average many simple to fit and simple to run decision trees. (is apparently what Kinect does)
  • Talk focuses on dropout, a clever new form of model averaging where only half of the units in the hidden layers are trained for a given example.
    • He is inspired by biological evolution, where sexual reproduction often spontaneously adds or removes genes, hence individual genes or small linked genes must be self-sufficient. This equates to a 'rugged individualism' of units.
    • Likewise, dropout forces neurons to be robust to the loss of co-workers.
    • This is also great for parallelization: each unit or sub-network can be trained independently, on it's own core, with little need for communication! Later, the units can be combined via genetic algorithms then re-trained.
  • Hinton then observes that sending a real value p (output of logistic function) with probability 0.5 is the same as sending 0.5 with probability p. Hence, it makes sense to try pure binary neurons, like biological neurons in the brain.
    • Indeed, if you replace the backpropagation with single bit propagation, the resulting neural network is trained more slowly and needs to be bigger, but it generalizes better.
    • Neurons (allegedly) do something very similar to this by poisson spiking. Hinton claims this is the right thing to do (rather than sending real numbers via precise spike timing) if you want to robustly fit models to data.
      • Sending stochastic spikes is a very good way to average over the large number of models fit to incoming data.
      • Yes but this really explains little in neuroscience...
  • Paper referred to in intro: Livnat, Papadimitriou and Feldman, PMID-19073912 and later by the same authors PMID-20080594
    • A mixability theory for the role of sex in evolution. -- "We define a measure that represents the ability of alleles to perform well across different combinations and, using numerical iterations within a classical population-genetic framework, show that selection in the presence of sex favors this ability in a highly robust manner"
    • Plus David MacKay's concise illustration of why you need sex, pg 269, __Information theory, inference, and learning algorithms__
      • With rather simple assumptions, asexual reproduction yields 1 bit per generation,
      • Whereas sexual reproduction yields G\sqrt G , where G is the genome size.

hide / / print
ref: -0 tags: google glucose sensing contact lens date: 04-28-2016 19:41 gmt revision:2 [1] [0] [head]

A contact lens with embedded sensor for monitoring tear glucose level

  • PMID-21257302
  • Metal stack: Ti 10nM / Pd 10nM / Pt 100nm.
  • on a 100um thick PET film.
  • A 30 µL glucose oxidase solution (10 mg/mL) was dropped onto the sensor area.
  • Then the sensor was suspended vertically above a titanium isopropoxide solution in a sealed dish for 6 h to create a GOD/titania sol-gel membrane, just as reported (Yu et al., 2003).
  • After forming the sol-gel membrane, a 30 µL aliquot of Nafion® solution was dropped onto the same area of the sensor, and allowed to dry in air for about 20 min.
  • Yet, the interference rejection from Nafion is imperfect; at 100uM concentrations, glucose is indistinguishable from ascorbic acid + lactate + urea.
  • Sensor drifts to 55% original performance after 4 days: figure 6
    • sensor was stored in a buffer @ 4C.
    • Probably OK for contact lenses, though.

hide / / print
ref: -0 tags: Moshe looks automatic programming google tech talk links date: 11-07-2012 07:38 gmt revision:3 [2] [1] [0] [head]

List of links from Moshe Looks google tech talk:

hide / / print
ref: notes-0 tags: data effectiveness Norvig google statistics machine learning date: 12-06-2011 07:15 gmt revision:1 [0] [head]

The unreasonable effectiveness of data.

  • counterpoint to Eugene Wigner's "The Unreasonable effectiveness of mathematics in the natural sciences"
    • that is, math is not effective with people.
    • we should not look for elegant theories, rather embrace complexity and make use of extensive data. (google's mantra!!)
  • in 2006 google released a trillion-word corpus with all words up to 5 words long.
  • document translation and voice transcription are successful mostly because people need the services - there is demand.
    • Traditional natural language processing does not have such demand as of yet. Furthermore, it has required human-annotated data, which is expensive to produce.
  • simple models and a lot of data triumph more elaborate models based on less data.
    • for translation and any other application of ML to web data, n-gram models or linear classifiers work better than elaborate models that try to discover general rules.
  • much web data consists of individually rare but collectively frequent events.
  • because of a huge shared cognitive and cultural context, linguistic expression can be highly ambiguous and still often be understood correctly.
  • mention project halo - $10,000 per page of a chemistry textbook. (funded by DARPA)
  • ultimately suggest that there is so so much to explore now - just use unlabeled data with an unsupervised learning algorithm.

hide / / print
ref: notes-0 tags: android google date: 07-09-2008 03:33 gmt revision:1 [0] [head]

brilliant!! source: android winners

hide / / print
ref: bookmark-0 tags: google parallel_computing GFS algorithm mapping reducing date: 0-0-2006 0:0 revision:0 [head]