SCAN: learning hierarchical compositional concepts
- From DeepMind, first version Jul 2017 / v3 June 2018.
- Starts broad and strong:
- "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules"
- Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details..
- "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts"
- "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts."
- "Compositionality is at the core of such human abilities as creativity, imagination, and language-based communication.
- This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision.
- Approach:
- Factorize the visual world with a -VAE to learn a set of representational primitives through unsupervised exposure to visual data.
- Expose SCAN (or rather, a module of it) to a small number of symbol-image pairs, from which the algorithm identifies the set if visual primitives (features from beta-VAE) that the examples have in common.
- E.g. this is purely associative learning, with a finite one-layer association matrix.
- Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..)
- Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( ), IN-COMMON ( ) & IGNORE ( or '-'). This is via a low-parameter convolutional model.
- Notation:
- is the encoder model. are the encoder parameters, is the visual input, are the latent parameters inferred from the scene.
- is the decoder model. , are the decoder parameters. is now the reconstructed scene.
- From this, the loss function of the beta-VAE is:
- where
- That is, maximize the auto-encoder fit (the expectation of the decoder, over the encoder output -- aka the pixel log-likelihood) minus the KL divergence between the encoder distribution and
- -- diagonal normal matrix.
- comes from the Lagrangian solution to the constrained optimization problem:
- subject to where D is the domain of images etc.
- Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual de-tangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising auto-encoder ref, which uses the feature L2 norm instead of the pixel log-likelihood:
- where maps from images to high-level features.
- This is from another neural network (transfer learning) which learns features beforehand.
- It's a multilayer perceptron denoising autoencoder [Vincent 2010].
- The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs and the latent outputs from encoder given .
-
- In this way, they can present a description to the network, which is then recomposed into , that then produces an image .
- The whole network is trained by minimizing:
-
- 1st term: log-likelihood of the decoded symbols given encoded latents
- 2nd term: weighted KL divergence between encoded latents and diagonal normal prior.
- 3rd term: weighted KL divergence between latents from the images and latents from the description .
- They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right.
- Final element! A convolutional recombination element, implemented as a tensor product between and that outputs a one-hot encoding of set-operation that's fed to a (hardcoded?) transformation matrix.
- I don't think this is great shakes. Could have done this with a small function; no need for a neural network.
- Trained with very similar loss function as SCAN or the beta-VAE.
- Testing:
-
- They seem to have used a very limited subset of "DeepMind Lab" -- all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing.
-
- This is marginally more interesting -- the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.)
- Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs.
|