Finding frequent items in data streams
 Notation:
 S is a data stream, $S = q_1, q_2, ..., q_n$ length n.
 Each object $q_i \in O = {o_1, ... o_m}$ That is, there are m total possible objects (e.g. English words).
 Object $o_i$ occurs $n_i$ times in S. The $o_n$ are ordered so that $n_1 \geq n_2 \geq n_m$ .
 Task:
 Given an input stream S, integer k, and real $\epsilon$
 Output a list of k elements from S such that each element has $n_i \gt (1\epsilon)n_k$ .
 That is, if the ordering is perfect, $n_i \geq n_k$ , with equality on the last element.
 Algorithm:
 $h_1, ..., h_t$ hashes from object q to buckets ${1, ..., b}$
 $s_1, ..., s_t$ hashes from object q to ${1, +1}$
 For each symbol, add it to the 2D hash array by hashing first with $h_i$ , then increment that counter with $s_i$ .
 The doublehasihing is to reduce the effect of collisions with highfrequency items.
 When querying for frequency of a object, hash like others, and take the median over i of $h_i[q] * s_i[q]$
 $t = O(log(\frac{n}{\delta}))$ where the algorithm fails with at most probability $\delta$
 Demonstrate proof of convergence / function with Zipfian distributions with varying exponent. (I did not read through this).
 Also showed that it's possible to compare these hashcounts directly to see what's changed,or importantly if the documents are different.
Mission: Ultra largescale feature selection using CountSketches
 Task:
 Given a labeled dataset $(X_i, y_i)$ for $i \in {1,2, ..., n}$ and $X_i \in \mathbb{R}^p, y_i \in \mathbb{R}$
 Find the ksparse feature vector / linear regression for the mean squares problem $\frac{min}{B_0=k} yX\Beta_2$
 $B_0=k$ counts the nonzero elements in the feature vector.
 THE number of features $p$ is so large that a dense $\Beta$ cannot be stored in memory. (X is of course sparse).
 Such data may be from ad clickthroughs, or from genomic analyses ...
 Use the countsketch algorithm (above) for capturing & continually updating the features for gradient update.
 That is, treat the stream of gradient updates, in the normal form $g_i = 2 \lambda (y_i  X_i \Beta_i X^t)^t X_i$ , as the semicontinuous time series used above as $S$
 Compare this with greedy thresholding, Iterative hard thresholding (IHT) e.g. throw away gradient information after each batch.
 This discards small gradients which may be useful for the regression problem.
 Works better, but not necessarily better than straight feature hashing (FH).

 Meh.

PMID29123069 A neural algorithm for a fundamental computing problem
 Ceneral idea: localitysensitive hashing, e.g. hashing that is sensitive to the highdimensional locality of the input space, can be efficiently solved using a circuit inspired by the insect olfactory system.
 Here, activation of 50 different types of ORNs is mapped to 50 projection neurons, which 'centers the mean'  concentration dependence is removed.
 This is then projected via a random matrix of sparse binary weights to a much larger set of Kenyon cells, which in turn are inhibited by one APL neuron.
 Normal localitysensitive hashing uses dense matrices of Gaussiandistributed random weights, which means higher computational complexity...
 ... these projections are governed by the JohnsonLindenstrauss lemma, which says that projection from highd to lowd space can preserve locality (distance between points) within an error bound.
 Show that the WTA selection of the top 5% plus random binary weight preserves locality as measured by overlap with exact input locality on toy data sets, including MNIST and SIFT.
 Flashy title as much as anything else got this into Science... indeed, has only been cited 6 times in Pubmed.

