Finding frequent items in data streams
 Notation:
 S is a data stream, $S = q_1, q_2, ..., q_n$ length n.
 Each object $q_i \in O = {o_1, ... o_m}$ That is, there are m total possible objects (e.g. English words).
 Object $o_i$ occurs $n_i$ times in S. The $o_n$ are ordered so that $n_1 \geq n_2 \geq n_m$ .
 Task:
 Given an input stream S, integer k, and real $\epsilon$
 Output a list of k elements from S such that each element has $n_i \gt (1\epsilon)n_k$ .
 That is, if the ordering is perfect, $n_i \geq n_k$ , with equality on the last element.
 Algorithm:
 $h_1, ..., h_t$ hashes from object q to buckets ${1, ..., b}$
 $s_1, ..., s_t$ hashes from object q to ${1, +1}$
 For each symbol, add it to the 2D hash array by hashing first with $h_i$ , then increment that counter with $s_i$ .
 The doublehasihing is to reduce the effect of collisions with highfrequency items.
 When querying for frequency of a object, hash like others, and take the median over i of $h_i[q] * s_i[q]$
 $t = O(log(\frac{n}{\delta}))$ where the algorithm fails with at most probability $\delta$
 Demonstrate proof of convergence / function with Zipfian distributions with varying exponent. (I did not read through this).
 Also showed that it's possible to compare these hashcounts directly to see what's changed,or importantly if the documents are different.
Mission: Ultra largescale feature selection using CountSketches
 Task:
 Given a labeled dataset $(X_i, y_i)$ for $i \in {1,2, ..., n}$ and $X_i \in \mathbb{R}^p, y_i \in \mathbb{R}$
 Find the ksparse feature vector / linear regression for the mean squares problem $\frac{min}{B_0=k} yX\Beta_2$
 $B_0=k$ counts the nonzero elements in the feature vector.
 THE number of features $p$ is so large that a dense $\Beta$ cannot be stored in memory. (X is of course sparse).
 Such data may be from ad clickthroughs, or from genomic analyses ...
 Use the countsketch algorithm (above) for capturing & continually updating the features for gradient update.
 That is, treat the stream of gradient updates, in the normal form $g_i = 2 \lambda (y_i  X_i \Beta_i X^t)^t X_i$ , as the semicontinuous time series used above as $S$
 Compare this with greedy thresholding, Iterative hard thresholding (IHT) e.g. throw away gradient information after each batch.
 This discards small gradients which may be useful for the regression problem.
 Works better, but not necessarily better than straight feature hashing (FH).

 Meh.
