“Visualizing data using tSNE”
 Laurens van der Maaten, Geoffrey Hinton.
 SNE: stochastic neighbor embedding, Hinton 2002.
 Idea: model the data conditional pairwise distribution as a gaussian, with one variance per data point, $p(x_i  x_j)$
 in the mapped data, this pairwise distribution is modeled as a fixedvariance gaussian, too, $q(y_i  y_j)$
 Goal is to minimize the KullbackLeibler divergence $\Sigma_i KL(p_i  q_i)$ (summed over all data points)
 Perdata point variance is found via binary search to match a userspecified perplexity. This amounts to setting a number of nearest neighbors, somewhere between 5 and 50 work ok.
 Cost function is minimized via gradient descent, starting with a random distribution of points yi, with plenty of momentum to speed up convergence, and noise to effect simulated annealing.
 Cost function is remarkably simple to reduce, gradient update: $\frac{\delta C}{\delta y_i} = 2 \Sigma_j(p_{ji}  q_{ji} + p_{ij}  q_{ij})(y_i  y_j)$
 tSNE differs from SNE (above) in that it addresses difficulty in optimizing the cost function, and crowding.
 Uses a simplified symmetric cost function (symmetric conditional probability, rather than joint probability) with simpler gradients
 Uses the student’s tdistribution in the lowdimensional map q to reduce crowding problem.
 The crowding problem is roughly resultant from the fact that, in highdimensional spaces, the volume of the local neighborhood scales as $r^m$ , whereas in 2D, it’s just $r^2$ . Hence there is costincentive to pushing all the points together in the map  points are volumetrically closer together in high dimensions than they can be in 2D.
 This can be alleviated by using a oneDOF student distribution, which is the same as a Cauchy distribution, to model the probabilities in map space.
 Smart  they plot the topology of the gradients to gain insight into modeling / convergence behavior.
 Don’t need simulated annealing due to balanced attractive and repulsive effects (see figure).
 Enhance the algorithm further by keeping it compact at the beginning, so that clusters can move through each other.
 Look up: dbits parity task by Bengio 2007
