 m8ta
 {1546} hide / / print ref: -1992 tags: Linsker infomax Hebbian anti-hebbian linear perceptron unsupervised learning date: 08-04-2021 00:20 gmt revision:2   [head] Ralph Linsker, 1992. A development upon {1545} -- this time with lateral inhibition trained through noise-contrast and anti-Hebbian plasticity. {1545} does not perfectly maximize the mutual information between the input and output -- this allegedly requires the inverse of the covariance matrix, $Q$ . As before, infomax principles; maximize mutual information $MI \propto H(Z) - H(Z | S)$ where Z is the network output and S is the signal input. (note: minimize the conditional entropy of output given the input). For a gaussian variable, $H = \frac{ 1}{ 2} ln det Q$ where Q is the covariance matrix. In this case $Q = E|Z Z^T |$ since $Z = C(S,N)$ where C are the weights, S is the signal, and N is the noise, $Q = C q C^T + r$ where q is the covariance matrix of input noise and r is the cov.mtx. of the output noise. (somewhat confusing): $\delta H / \delta C = Q^{-1}Cq$ because .. the derivative of the determinant is complicated. Check the appendix for the derivation. $ln det Q = Tr ln Q$ and $dH = 1/2 d(Tr ln Q) = 1/2 Tr( Q^-1 dQ )$ -- this holds for positive semidefinite matrices like Q. From this he comes up with a set of rules whereby feedforward weights are trained in a Hebbian fashion, but based on activity after lateral activation. The lateral activation has a weight matrix $F = I - \alpha Q$ (again Q is the cov.mtx. of Z). If $y(0) = Y; y(t+1) = Y + Fy(t)$ , where Y is the feed-forward activation, then $\alpha y(\inf) = Q^{-1}Y$ . This checks out: x = randn(1000, 10); Q = x' * x; a = 0.001; Y = randn(10, 1); y = zeros(10, 1); for i = 1:1000 y = Y + (eye(10) - a*Q)*y; end y - pinv(Q)*Y / a % should be zero.  This recursive definition is from Jacobi. $\alpha y(\inf) = \alpha \Sigma_{t=0}^{\inf}F^tY = \alpha(I - F)^{-1} Y = Q^{-1}Y$ . Still, you need to estimate Q through a running-average, $\Delta Q = \frac{ 1}{M}( Y_n Y_m + r_{nm} - Q_{NM} )$ and since $F = I - \alpha Q$ , F is formed via anti-hebbian terms. To this is added a 'sensing' learning and 'noise' unlearning phase -- one optimizes $H(Z)$ , the other minimizes $H(Z|S)$ . Everything is then applied, similar to before, to a gaussian-filtered one-dimensional white-noise stimuli. He shows this results in bandpass filter behavior -- quite weak sauce in an era where ML papers are expected to test on five or so datasets. Even if this was 1992 (nearly forty years ago!), it would have been nice to see this applied to a more realistic dataset; perhaps some of the following papers? Olshausen & Field came out in 1996 -- but they applied their algorithm to real images. In both Olshausen & this work, no affordances are made for multiple layers. There have to be solutions out there... {1545} hide / / print ref: -1988 tags: Linsker infomax linear neural network hebbian learning unsupervised date: 08-03-2021 06:12 gmt revision:2   [head] Ralph Linsker, 1988. One of the first (verbose, slightly diffuse) investigations of the properties of linear projection neurons (e.g. dot-product; no non-linearity) to express useful tuning functions. ''Useful' is here information-preserving, in the face of noise or dimensional bottlenecks (like PCA). Starts with Hebbian learning functions, and shows that this + white-noise sensory input + some local topology, you can get simple and complex visual cell responses. Ralph notes that neurons in primate visual cortex are tuned in utero -- prior real-world visual experience! Wow. (Who did these studies?) This is a very minimalistic starting point; there isn't even structured stimuli (!) Single neuron (and later, multiple neurons) are purely feed-forward; author cautions that a lack of feedback is not biologically realistic. Also note that this was back in the Motorola 680x0 days ... computers were not that powerful (but certainly could handle more than 1-2 neurons!) Linear algebra shows that Hebbian synapses cause a linear layer to learn the covariance function of their inputs, $Q$ , with no dependence on the actual layer activity. When looked at in terms of an energy function, this is equivalent to gradient descent to maximize the layer-output variance. He also hits on: Hopfield networks, PCA, Oja's constrained Hebbian rule $\delta w_i \propto < L_2(L_1 - L_2 w_i) >$ (that is, a quadratic constraint on the weight to make $\Sigma w^2 \sim 1$ ) Optimal linear reconstruction in the presence of noise Mutual information between layer input and output (I found this to be a bit hand-wavey) Yet he notes critically: "but it is not true that maximum information rate and maximum activity variance coincide when the probability distribution of signals is arbitrary". Indeed. The world is characterized by very non-Gaussian structured sensory stimuli. Redundancy and diversity in 2-neuron coding model. Role of infomax in maximizing the determinant of the weight matrix, sorta. One may critically challenge the infomax idea: we very much need to (and do) throw away spurious or irrelevant information in our sensory streams; what upper layers 'care about' when making decisions is certainly relevant to the lower layers. This credit-assignment is neatly solved by backprop, and there are a number 'biologically plausible' means of performing it, but both this and infomax are maybe avoiding the problem. What might the upper layers really care about? Likely 'care about' is an emergent property of the interacting local learning rules and network structure. Can you search directly in these domains, within biological limits, and motivated by statistical reality, to find unsupervised-learning networks? You'll still need a way to rank the networks, hence an objective 'care about' function. Sigh. Either way, I don't per se put a lot of weight in the infomax principle. It could be useful, but is only part of the story. Otherwise Linsker's discussion is accessible, lucid, and prescient. Lol. {1492} hide / / print ref: -2016 tags: spiking neural network self supervised learning date: 12-10-2019 03:41 gmt revision:2   [head] This is a meandering, somewhat long-winded, and complicated paper, even for the journal Science. It's not been cited a great many times, but none-the-less is of interest. The goal of the derived network is to detect fixed-pattern presynaptic sequences, and fire a prespecified number of spikes to each occurrence. One key innovation is the use of a spike-threshold-surface for a 'tempotron' , the derivative of which is used to update the weights of synapses after trials. As the author says, spikes are hard to differentiate; the STS makes this more possible. This is hence standard gradient descent: if the neuron missed a spike then the weight is increased based on aggregate STS (for the whole trial -- hence the neuron / SGD has to perform temporal and spatial credit assignment). As common, the SGD is appended with a momentum term. Since STS differentiation is biologically implausible -- where would the memory lie? -- he also implements a correlational synaptic eligibility trace. The correlation is between the postsynaptic voltage and the EPSC, which seems kinda circular. Unsurprisingly, it does not work as well as the SGD approximation. But does work... Second innovation is the incorporation of self-supervised learning: a 'supervisory' neuron integrates the activity of a number (50) of feature detector neurons, and reinforces them to basically all fire at the same event, WTA style. This effects a unsupervised feature detection. This system can be used with sort-of lateral inhibition to reinforce multiple features. Not so dramatic -- continuous feature maps. Editorializing a bit: I said this was interesting, but why? The first part of the paper is another form of SGD, albeit in a spiking neural network, where the gradient is harder compute hence is done numerically. It's the aggregate part that is new -- pulling in repeated patterns through synaptic learning rules. Of course, to do this, the full trace of pre and post synaptic activity must be recorded (??) for estimating the STS (i think). An eligibility trace moves in the right direction as a biologically plausible approximation, but as always nothing matches the precision of SGD. Can the eligibility trace be amended with e.g. neuromodulators to push the performance near that of SGD? The next step of adding self supervised singular and multiple features is perhaps toward the way the brain organizes itself -- small local feedback loops. These features annotate repeated occurrences of stimuli, or tile a continuous feature space. Still, the fact that I haven't seen any follow-up work is suggestive... Editorializing further, there is a limited quantity of work that a single human can do. In this paper, it's a great deal of work, no doubt, and the author offers some good intuitions for the design decisions. Yet still, the total complexity that even a very determined individual can amass is limited, and likely far below the structural complexity of a mammalian brain. This implies that inference either must be distributed and compositional (the normal path of science), or the process of evaluating & constraining models must be significantly accelerated. This later option is appealing, as current progress in neuroscience seems highly technology limited -- old results become less meaningful when the next wave of measurement tools comes around, irrespective of how much work went into it. (Though: the impedtus for measuring a particular thing in biology is only discovered through these 'less meaningful' studies...). A third option, perhaps one which many theoretical neuroscientists believe in, is that there are some broader, physics-level organizing principles to the brain. Karl Friston's free energy principle is a good example of this. Perhaps at a meta level some organizing theory can be found, or likely a set of theories; but IMHO, you'll need at least one theory per brain area, at least, just the same as each area is morphologically, cytoarchitecturaly, and topologically distinct. (There may be only a few theories of the cortex, despite all the areas, which is why so many are eager to investigate it!) So what constitutes a theory? Well, you have to meaningfully describe what a brain region does. (Why is almost as important; how more important to the path there.) From a sensory standpoint: what information is stored? What processing gain is enacted? How does the stored information impress itself on behavior? From a motor standpoint: how are goals selected? How are the behavioral segments to attain them sequenced? Is the goal / behavior even a reasonable way of factoring the problem? Our dual problem, building the bridge from the other direction, is perhaps easier. Or it could be a lot more money has gone into it. Either way, much progress has been made in AI. One arm is deep function approximation / database compression for fast and organized indexing, aka deep learning. Many people are thinking about that; no need to add to the pile; anyway, as OpenAI has proven, the common solution to many problems is to simply throw more compute at it. A second is deep reinforcement learning, which is hideously sample and path inefficient, hence ripe for improvement. One side is motor: rather than indexing raw motor variables (LRUD in a video game, or joint torques with a robot..) you can index motor primitives, perhaps hierarchically built; likewise, for the sensory input, the model needs to infer structure about the world. This inference should decompose overwhelming sensory experience into navigable causes ... But how can we do this decomposition? The cortex is more than adept at it, but now we're at the original problem, one that the paper above purports to make a stab at. {1454} hide / / print ref: -2011 tags: Andrew Ng high level unsupervised autoencoders date: 03-15-2019 06:09 gmt revision:7       [head] Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, Andrew Y. Ng Input data 10M random 200x200 frames from youtube. Each video contributes only one frame. Used local receptive fields, to reduce the communication requirements. 1000 computers, 16 cores each, 3 days. "Strongly influenced by" Olshausen & Field {1448} -- but this is limited to a shallow architecture. Lee et al 2008 show that stacked RBMs can model simple functions of the cortex. Lee et al 2009 show that convolutonal DBN trained on faces can learn a face detector. Their architecture: sparse deep autoencoder with Local receptive fields: each feature of the autoencoder can connect to only a small region of the lower layer (e.g. non-convolutional) Purely linear layer. More biologically plausible & allows the learning of more invariances other than translational invariances (Le et al 2010). No weight sharing means the network is extra large == 1 billion weights. Still, the human visual cortex is about a million times larger in neurons and synapses. L2 pooling (Hyvarinen et al 2009) which allows the learning of invariant features. E.g. this is the square root of the sum of the squares of its inputs. Square root nonlinearity. Local contrast normalization -- subtractive and divisive (Jarrett et al 2009) Encoding weights $W_1$ and deconding weights $W_2$ are adjusted to minimize the reconstruction error, penalized by 0.1 * the sparse pooling layer activation. Latter term encourages the network to find invariances. $minimize(W_1, W_2)$ $\sum_{i=1}^m {({ ||W_2 W_1^T x^{(i)} - x^{(i)} ||^2_2 + \lambda \sum_{j=1}^k{ \sqrt{\epsilon + H_j(W_1^T x^{(i)})^2}} })}$ $H_j$ are the weights to the j-th pooling element, $\lambda = 0.1$ ; m examples; k pooling units. This is also known as reconstruction Topographic Independent Component Analysis. Weights are updated through asynchronous SGD. Minibatch size 100. Note deeper autoencoders don't fare consistently better. {20} hide / / print ref: bookmark-0 tags: neural_networks machine_learning matlab toolbox supervised_learning PCA perceptron SOM EM date: 0-0-2006 0:0 revision:0 [head] http://www.ncrg.aston.ac.uk/netlab/index.php n.b. kinda old. (or does that just mean well established?)