“Visualizing data using tSNE”
 Laurens van der Maaten, Geoffrey Hinton.
 SNE: stochastic neighbor embedding, Hinton 2002.
 Idea: model the data conditional pairwise distribution as a gaussian, with one variance per data point, $p(x_i  x_j)$
 in the mapped data, this pairwise distribution is modeled as a fixedvariance gaussian, too, $q(y_i  y_j)$
 Goal is to minimize the KullbackLeibler divergence $\Sigma_i KL(p_i  q_i)$ (summed over all data points)
 Perdata point variance is found via binary search to match a userspecified perplexity. This amounts to setting a number of nearest neighbors, somewhere between 5 and 50 work ok.
 Cost function is minimized via gradient descent, starting with a random distribution of points yi, with plenty of momentum to speed up convergence, and noise to effect simulated annealing.
 Cost function is remarkably simple to reduce, gradient update: $\frac{\delta C}{\delta y_i} = 2 \Sigma_j(p_{ji}  q_{ji} + p_{ij}  q_{ij})(y_i  y_j)$
 tSNE differs from SNE (above) in that it addresses difficulty in optimizing the cost function, and crowding.
 Uses a simplified symmetric cost function (symmetric conditional probability, rather than joint probability) with simpler gradients
 Uses the student’s tdistribution in the lowdimensional map q to reduce crowding problem.
 The crowding problem is roughly resultant from the fact that, in highdimensional spaces, the volume of the local neighborhood scales as $r^m$ , whereas in 2D, it’s just $r^2$ . Hence there is costincentive to pushing all the points together in the map  points are volumetrically closer together in high dimensions than they can be in 2D.
 This can be alleviated by using a oneDOF student distribution, which is the same as a Cauchy distribution, to model the probabilities in map space.
 Smart  they plot the topology of the gradients to gain insight into modeling / convergence behavior.
 Don’t need simulated annealing due to balanced attractive and repulsive effects (see figure).
 Enhance the algorithm further by keeping it compact at the beginning, so that clusters can move through each other.
 Look up: dbits parity task by Bengio 2007

Dimensionality reduction by learning and invariant mapping
 Raia Hadsell, Sumit Chopra, Yann LeCun
 Central idea: learn and invariant mapping of the input by minimizing mapped distance (e.g. the distance between outputs) when the samples are categorized as the same (same numbers in MNIST eg), and maximizing mapped distance when the samples are categorized as distant.
 Two loss functions for same vs different.
 This is an attractionrepulsion spring analogy.
 Use gradient descent to change the weights to satisfy these two competing losses.
 Resulting constitutional neural nets can extract camera pose information from the NORB dataset.

 Surprising how simple analogies like this, when iterated across a great many samples, pull out intuitively correct invariances.

PMID15013228[] Information processing, dimensionality reduction, and reinforcement learning in the basal ganglia (2003)
 long paper! looks like they used latex.

 they focus on a 'new model' for the basal ganglia: reinforcement driven dimensionality reduction (RDDR)
 in order to make sense of the system  according to them  any model must ingore huge ammounts of information about the studied areas.
 ventral striatum = nucelus accumbens!
 striatum is broken into two, rough, parts: ventral and dorsal
 dorsal striatum: the caudate and putamen are a part of the
 ventral striatum: the nucelus accumbens, medial and ventral portions of the caudate and putamen, and striatal cells of the olifactory tubercle (!) and anterior perforated substance.
 ~90 of neurons in the striatum are medium spiny neurons
 dendrites fill 0.5mm^3
 cells have up and down states.
 the states are controlled by intrinsic connections
 project to GPe GPi & SNr (primarily), using GABA.
 12% of neurons in the striatum are tonically active neurons (TANs)
 use acetylcholine (among others)
 fewer spines
 more sensitive to input
 TANs encode information relevant to reinforcement or incentive behavior
____References____
