Improved asymmetric locality sensitive hashing for maximum inner product search
 Like many other papers, this one is based on a long lineage of localitysensitive hashing papers.
 Key innovation, in [23] The power of asymmetry in binary hashing, was the development of asymmetric hashing  the hash function of the query is different than the hash function used for storage. Roughly, this allows additional degrees of freedom since the similarityfunction is (in the nonnormalized case) nonsymmetric.
 For example, take query Q = [1 1] with keys A = [1 1] and B = [3 3]. The nearest neighbor is A (distance 2), whereas the maximum inner product is B (inner product 6).
 Alternately: selfinner product for Q and A is 2, whereas for B it's 18. Selfsimilarity is not the highest with inner products.
 Norm of the query does not have an effect on the arg max of the search, though. Hence, for the paper assume that the query has been normalized for MIPS.
 In this paper instead they convert MIPS into approximate cosine similarity search (which is like normalized MIPS), which can be efficiently solved with signed random projections.
 (Established): LSHL2 distance:
 Sample a random vector a, iid normal N(0,1)
 Sample a random normal b between 0 and r
 r is the window size / radius (free parameters?)
 Hash function is then the floor of the inner product of the vector a and input x + b divided by the radius.
 I'm not sure about how the floor op is converted to bits of the actual hash  ?
 (Established): LSHcorrelation, signed random projections $h^{sign}$ :
 Hash is the sign of the inner product of the input vector and a uniform random vector a.
 This is a twobit random projection [13][14].
 (New) AsymmetricLSHL2:
 $P(x) = [x;x^2_2; x^4_2; .... ; x^{2^m}_2]$  this is the preprocessing hashing of the 'keys'.
 Requires that then norm of these keys, {x}_2 < U < 1$$
 $m \geq 3$
 $Q(x) = [x;1/2; 1/2; ... ; 1/2]$  hashing of the queries.
 See the mathematical explanation in the paper, but roughly "transformations P and Q, when normas are less than 1, provide correction to the L2 distance $Q(p)  P(x_i)_2$ , making in rank correlate with unnormalized inner product."
 They then change the augmentation to:
 $P(x) = [x; 1/2  x^2_2; 1/2  x^4_2; ... ; 1/2  x^{2^m}_2]$
 $Q(x) = [x; 0; ...; 0]$
 This allows use of signed nearestneighbor search to be used in the MIPS problem. (e.g. the hash is the sign of P and Q, per above; I assume this is still a 2bit operation?)
 Then the expand the U,M compromise function $\rho$ to allow for nonnormalized queries. U depends on m and c (m is the codeword extension, and c is the ratio between otarget and offtarget hash hits.
 Tested on Movielens and Netflix databases, this using SVD preprocessing on the useritem matrix (full rank matrix indicating every user rating on every movie (mostly zeros!)) to get at the latent vectors.

 In the above plots, recall (hah) that precision is the number of true positives / number of false positives as the number of draws k increases; recall is the number of true positives / number of draws k.
 Clearly, the curve bends up and to the right when there are a lot of hash tables K.
 Example datapoint: 50% precision at 40% recall, top 5. So on average you get 2 correct hits in 4 draws. Or: 40% precision, 20% recall, top 10: 2 hits in 5 draws. 20/40: 4 hits in 20 draws. (hit: correctly within the topN)
 So ... it's not that great.
Use case: Capsule: a camera based positioning system using learning
 Uses 512 SIFT features as keys and queries to LSH. Hashing is computed via sparse addition / subtraction algorithm, with K bits per hash table (not quite random projections) and L hash tables. K = 22 and L = 24. ~ 1000 training images.
 Best matching image is used as the location of the current image.

Winnertakeall Autoencoders
 During training of fully connected layers, they enforce a winnertake all lifetime sparsity constraint.
 That is: when training using minibatches, they keep the k percent largest activation of a given hidden unit across all samples presented in the minibatch. The remainder of the activations are set to zero. The units are not competing with each other; they are competing with themselves.
 The rest of the network is a stack of ReLU layers (upon which the sparsity constraint is applied) followed by a linear decoding layer (which makes interpretation simple).
 They stack them via sequential training: train one layer from the output of another & not backprop the errors.
 Works, with lower sparsity targets, also for RBMs.
 Extended the result to WTA covnets  here enforce both spatial and temporal (minibatch) sparsity.
 Spatial sparsity involves selecting the single largest hidden unit activity within each feature map. The other activities and derivatives are set to zero.
 At test time, this sparsity constraint is released, and instead they use a 4 x 4 maxpooling layer & use that for classification or deconvolution.
 To apply both spatial and temporal sparsity, select the highest spatial response (e.g. one unit in a 2d plane of convolutions; all have the same weights) for each feature map. Do this for every image in a minibatch, and then apply the temporal sparsity: each feature map gets to be active exactly once, and in that time only one hidden unit (or really, one location of the input and common weights (depending on stride)) undergoes SGD.
 Seems like it might train very slowly. Authors didn't note how many epochs were required.
 This, too can be stacked.
 To train on larger image sets, they first extract 48 x 48 patches & again stack...
 Test on MNIST, SVHN, CIFAR10  works ok, and well even with few labeled examples (which is consistent with their goals)

