 m8ta
 {1564} hide / / print ref: -2008 tags: t-SNE dimensionality reduction embedding Hinton date: 01-25-2022 20:39 gmt revision:2   [head] “Visualizing data using t-SNE” Laurens van der Maaten, Geoffrey Hinton. SNE: stochastic neighbor embedding, Hinton 2002. Idea: model the data conditional pairwise distribution as a gaussian, with one variance per data point, $p(x_i | x_j)$ in the mapped data, this pairwise distribution is modeled as a fixed-variance gaussian, too, $q(y_i | y_j)$ Goal is to minimize the Kullback-Leibler divergence $\Sigma_i KL(p_i || q_i)$ (summed over all data points) Per-data point variance is found via binary search to match a user-specified perplexity. This amounts to setting a number of nearest neighbors, somewhere between 5 and 50 work ok. Cost function is minimized via gradient descent, starting with a random distribution of points yi, with plenty of momentum to speed up convergence, and noise to effect simulated annealing. Cost function is remarkably simple to reduce, gradient update: $\frac{\delta C}{\delta y_i} = 2 \Sigma_j(p_{j|i} - q_{j-i} + p_{i|j} - q_{i|j})(y_i - y_j)$ t-SNE differs from SNE (above) in that it addresses difficulty in optimizing the cost function, and crowding. Uses a simplified symmetric cost function (symmetric conditional probability, rather than joint probability) with simpler gradients Uses the student’s t-distribution in the low-dimensional map q to reduce crowding problem. The crowding problem is roughly resultant from the fact that, in high-dimensional spaces, the volume of the local neighborhood scales as $r^m$ , whereas in 2D, it’s just $r^2$ . Hence there is cost-incentive to pushing all the points together in the map -- points are volumetrically closer together in high dimensions than they can be in 2D. This can be alleviated by using a one-DOF student distribution, which is the same as a Cauchy distribution, to model the probabilities in map space. Smart -- they plot the topology of the gradients to gain insight into modeling / convergence behavior. Don’t need simulated annealing due to balanced attractive and repulsive effects (see figure). Enhance the algorithm further by keeping it compact at the beginning, so that clusters can move through each other. Look up: d-bits parity task by Bengio 2007