Oneshot learning by inverting a compositional causal process
 Brenden Lake, Russ Salakhutdinov, Josh Tennenbaum
 This is the paper that preceded the 2015 Science publication "Human level concept learning through probabalistic program induction"
 Because it's a NIPS paper, and not a science paper, this one is a bit more accessible: the logic to the details and developments is apparent.
 General idea: build up a fully probabilistic model of multilanguage (omniglot corpus) characters / tokens. This model includes things like character type / alphabet, number of strokes, curvature of strokes (parameterized via bezier splines), where strokes attach to others (spatial relations), stroke scale, and character scale. The model (won't repeat the formal definition) is factorized to be both compositional and causal, though all the details of the conditional probs are left to the supplemental material.
 They fit the complete model to the Omniglot data using gradient descent + imagespace noising, e.g tweak the free parameters of the model to generate images that look like the human created characters. (This too is in the supplement).
 Because the model is highdimensional and hard to invert, they generate a perceptual model by winnowing down the image into a skeleton, then breaking this into a variable number of strokes.
 The probabilistic model then assigns a loglikelihood to each of the parses.
 They then use the model with MetropolisHastings MCMC to sample a region in parameter space around each parse  and they extra sample $\psi$ (the character type) to get a greater weighted diversity of types.
 Surprisingly, they don't estimate the image likelihood  which is expensive  they here just redo the parsing based on aggregate info embedded in the statistical model. Clever.
 $\psi$ is the character type (a, b, c..), $\psi = { \kappa, S, R }$ where kappa are the number of strokes, S is a set of parameterized strokes, R are the relations between strokes.
 $\theta$ are the pertoken stroke parameters.
 $I$ is the image, obvi.
 Classification task: one image of a new character (c) vs 20 characters new characters from the same alphabet (test, (t)). In the 20 there is one character of the same type  task is to find it.
 With 'hierarchical bayesian program learning', they not only anneal the type to the parameters (with MCMC, above) for the test image, but they also fit the parameters using gradient descent to the image.
 Subsequently parses the test image onto the class image (c)
 Hence the best classification is the one where both are in the best agreement: $\underset{c}{argmax} \frac{P(ct)}{P(c)} P(tc)$ where $P(c)$ is approximated as the parse weights.
 Again, this is clever as it allows significant information leakage between (c) and (t) ...
 The other models (Affine, Deep Boltzman Machines, Hierarchical Deep Model) have nothing like this  they are feedforward.
 No wonder HBPL performs better. It's a better model of the data, that has a bidirectional fitting routine.
 As i read the paper, had a few vague 'hedons':
 Model building is essential. But unidirectional models are insufficient; if the models include the mechanism for their own inversion many fitting and inference problems are solved. (Such is my intuition)
 As a corrolary of this, having both forward and backward tags (links) can be used to neatly solve the binding problem. This should be easy in a computer w/ pointers, though in the brain I'm not sure how it might work (?!) without some sort of combinatorial explosion?
 The fitting process has to be multipass or at least reentrant. Both this paper and the Vicarious CAPTCHA paper feature statistical message passing to infer or estimate hidden explanatory variables. Seems correct.
 The model here includes relations that are conditional on stroke parameters that occurred / were parsed beforehand; this is very appealing in that the model/generator/AI needs to be flexibly reentrant to support hierarchical planning ...

SCAN: learning hierarchical compositional concepts
 From DeepMind, first version Jul 2017 / v3 June 2018.
 Starts broad and strong:
 "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules"
 Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details..
 "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts"
 "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts."
 "Compositionality is at the core of such human abilities as creativity, imagination, and languagebased communication.
 This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision.
 Approach:
 Factorize the visual world with a $\Beta$ VAE to learn a set of representational primitives through unsupervised exposure to visual data.
 Expose SCAN (or rather, a module of it) to a small number of symbolimage pairs, from which the algorithm identifies the set if visual primitives (features from betaVAE) that the examples have in common.
 E.g. this is purely associative learning, with a finite onelayer association matrix.
 Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..)
 Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( $\cup$ ), INCOMMON ( $\cap$ ) & IGNORE ( $\setminus$ or ''). This is via a lowparameter convolutional model.
 Notation:
 $q_{\phi}(z_xx)$ is the encoder model. $\phi$ are the encoder parameters, $x$ is the visual input, $z_x$ are the latent parameters inferred from the scene.
 $p_{theta}(xz_x)$ is the decoder model. $x \propto p_{\theta}(xz_x)$ , $\theta$ are the decoder parameters. $x$ is now the reconstructed scene.
 From this, the loss function of the betaVAE is:
 $\mathbb{L}(\theta, \phi; x, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_xx)} [log p_{\theta}(xz_x)]  \beta D_{KL} (q_{\phi}(z_xx) p(z_x))$ where $\Beta \gt 1$
 That is, maximize the autoencoder fit (the expectation of the decoder, over the encoder output  aka the pixel loglikelihood) minus the KL divergence between the encoder distribution and $p(z_x)$
 $p(z) \propto \mathcal{N}(0, I)$  diagonal normal matrix.
 $\beta$ comes from the Lagrangian solution to the constrained optimization problem:
 $\max_{\phi,\theta} \mathbb{E}_{x \sim D} [\mathbb{E}_{q_{\phi}(zx)}[log p_{\theta}(xz)]]$ subject to $D_{KL}(q_{\phi}(zx)p(z)) \lt \epsilon$ where D is the domain of images etc.
 Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual detangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising autoencoder ref, which uses the feature L2 norm instead of the pixel loglikelihood:
 $\mathbb{L}(\theta, \phi; X, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_xx)}J(\hat{x})  J(x)_2^2  \beta D_{KL} (q_{\phi}(z_xx) p(z_x))$ where $J : \mathbb{R}^{W x H x C} \rightarrow \mathbb{R}^N$ maps from images to highlevel features.
 This $J(x)$ is from another neural network (transfer learning) which learns features beforehand.
 It's a multilayer perceptron denoising autoencoder [Vincent 2010].
 The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs $y$ and the latent outputs from encoder $z_x$ given $x$ .

 In this way, they can present a description $y$ to the network, which is then recomposed into $z_y$ , that then produces an image $\hat{x}$ .
 The whole network is trained by minimizing:
 $\mathbb{L}_y(\theta_y, \phi_y; y, x, z_y, \beta, \lambda) = 1^{st}  2^{nd}  3^{rd}$
 1st term: $\mathbb{E}_{q_{\phi_y}(z_yy)}[log p_{\theta_y} (yz_y)]$ loglikelihood of the decoded symbols given encoded latents $z_y$
 2nd term: $\beta D_{KL}(q_{\phi_y}(z_yy)  p(z_y))$ weighted KL divergence between encoded latents and diagonal normal prior.
 3rd term: $\lambda D_{KL}(q_{\phi_x}(z_xy)  q_{\phi_y}(z_yy))$ weighted KL divergence between latents from the images and latents from the description $y$ .
 They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right.
 Final element! A convolutional recombination element, implemented as a tensor product between $z_{y1}$ and $z_{y2}$ that outputs a onehot encoding of setoperation that's fed to a (hardcoded?) transformation matrix.
 I don't think this is great shakes. Could have done this with a small function; no need for a neural network.
 Trained with very similar loss function as SCAN or the betaVAE.
 Testing:

 They seem to have used a very limited subset of "DeepMind Lab"  all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing.

 This is marginally more interesting  the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.)
 Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs.
