m8ta
You are not authenticated, login.
text: sort by
tags: modified
type: chronology
{1545}
hide / / print
ref: -1988 tags: Linsker infomax linear neural network hebbian learning unsupervised date: 08-03-2021 06:12 gmt revision:2 [1] [0] [head]

Self-organizaton in a perceptual network

  • Ralph Linsker, 1988.
  • One of the first (verbose, slightly diffuse) investigations of the properties of linear projection neurons (e.g. dot-product; no non-linearity) to express useful tuning functions.
  • ''Useful' is here information-preserving, in the face of noise or dimensional bottlenecks (like PCA).
  • Starts with Hebbian learning functions, and shows that this + white-noise sensory input + some local topology, you can get simple and complex visual cell responses.
    • Ralph notes that neurons in primate visual cortex are tuned in utero -- prior real-world visual experience! Wow. (Who did these studies?)
    • This is a very minimalistic starting point; there isn't even structured stimuli (!)
    • Single neuron (and later, multiple neurons) are purely feed-forward; author cautions that a lack of feedback is not biologically realistic.
      • Also note that this was back in the Motorola 680x0 days ... computers were not that powerful (but certainly could handle more than 1-2 neurons!)
  • Linear algebra shows that Hebbian synapses cause a linear layer to learn the covariance function of their inputs, QQ , with no dependence on the actual layer activity.
  • When looked at in terms of an energy function, this is equivalent to gradient descent to maximize the layer-output variance.
  • He also hits on:
    • Hopfield networks,
    • PCA,
    • Oja's constrained Hebbian rule δw i<L 2(L 1L 2w i)> \delta w_i \propto &lt; L_2(L_1 - L_2 w_i) &gt; (that is, a quadratic constraint on the weight to make Σw 21\Sigma w^2 \sim 1 )
    • Optimal linear reconstruction in the presence of noise
    • Mutual information between layer input and output (I found this to be a bit hand-wavey)
      • Yet he notes critically: "but it is not true that maximum information rate and maximum activity variance coincide when the probability distribution of signals is arbitrary".
        • Indeed. The world is characterized by very non-Gaussian structured sensory stimuli.
    • Redundancy and diversity in 2-neuron coding model.
    • Role of infomax in maximizing the determinant of the weight matrix, sorta.

One may critically challenge the infomax idea: we very much need to (and do) throw away spurious or irrelevant information in our sensory streams; what upper layers 'care about' when making decisions is certainly relevant to the lower layers. This credit-assignment is neatly solved by backprop, and there are a number 'biologically plausible' means of performing it, but both this and infomax are maybe avoiding the problem. What might the upper layers really care about? Likely 'care about' is an emergent property of the interacting local learning rules and network structure. Can you search directly in these domains, within biological limits, and motivated by statistical reality, to find unsupervised-learning networks?

You'll still need a way to rank the networks, hence an objective 'care about' function. Sigh. Either way, I don't per se put a lot of weight in the infomax principle. It could be useful, but is only part of the story. Otherwise Linsker's discussion is accessible, lucid, and prescient.

Lol.

{1544}
hide / / print
ref: -2019 tags: HSIC information bottleneck deep learning backprop gaussian kernel date: 07-21-2021 16:28 gmt revision:4 [3] [2] [1] [0] [head]

The HSIC Bottleneck: Deep learning without Back-propagation

In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set per-layer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbert-schmidt independence criterion, as the independence measure.

The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the output and the labels while minimizing the mutual information between the output and the labels:

minP T i|X)I(X;T i)ΒI(T i;Y)\frac{min}{P_{T_i} | X)} I(X; T_i) - \Beta I(T_i; Y)

Where T iT_i is the hidden representation at layer i (later output), XX is the layer input, and YY are the labels. By replacing I()I() with the HSIC, and some derivation (?), they show that

HSIC(D)=(m1) 2tr(K XHK yH)HSIC(D) = (m-1)^{-2} tr(K_X H K_y H)

Where D=(x 1,y 1),...(x m,y m)D = {(x_1,y_1), ... (x_m, y_m)} are samples and labels, K X ij=k(x i,x j)K_{X_{ij}} = k(x_i, x_j) and K Y ij=k(y i,y j)K_{Y_{ij}} = k(y_i, y_j) -- that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, k(x,y)=exp(1/2||xy|| 2/σ 2)k(x,y) = exp(-1/2 ||x-y||^2/\sigma^2) . So, if all the x and y are on average independent, then the inner-product will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices.

But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outer-product spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this...

For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks albeit in a much less intelligable way.

{1543}
hide / / print
ref: -2019 tags: backprop neural networks deep learning coordinate descent alternating minimization date: 07-21-2021 03:07 gmt revision:1 [0] [head]

Beyond Backprop: Online Alternating Minimization with Auxiliary Variables

  • This paper is sort-of interesting: rather than back-propagating the errors, you optimize auxiliary variables, pre-nonlinearity 'codes' in a last-to-first layer order. The optimization is done to minimize a multimodal logistic loss function; math is not done to minimize other loss functions, but presumably this is not a limit. The loss function also includes a quadratic term on the weights.
  • After the 'codes' are set, optimization can proceed in parallel on the weights. This is done with either straight SGD or adaptive ADAM.
  • Weight L2 penalty is scheduled over time.

This is interesting in that the weight updates can be cone in parallel - perhaps more efficient - but you are still propagating errors backward, albeit via optimizing 'codes'. Given the vast infractructure devoted to auto-diff + backprop, I can't see this being adopted broadly.

That said, the idea of alternating minimization (which is used eg for EM clustering) is powerful, and this paper does describe (though I didn't read it) how there are guarantees on the convexity of the alternating minimization. Likewise, the authors show how to improve the performance of the online / minibatch algorithm by keeping around memory variables, in the form of covariance matrices.

{1542}
hide / / print
ref: -0 tags: gpu burn stress test github cuda date: 07-13-2021 21:32 gmt revision:0 [head]

https://github.com/wilicc/gpu-burn Mult-gpu stress test.

Are your GPUs overclocked to the point of overheating / being unreliable?

{1541}
hide / / print
ref: -0 tags: machine learning blog date: 04-22-2021 15:43 gmt revision:0 [head]

Paper notes by Vitaly Kurin

Like this blog but 100% better!

{1540}
hide / / print
ref: -2020 tags: feedback alignment local hebbian learning rules stanford date: 04-22-2021 03:26 gmt revision:0 [head]

Two Routes to Scalable Credit Assignment without Weight Symmetry

This paper looks at five different learning rules, three purely local, and two non-local, to see if they can work as well as backprop in training a deep convolutional net on ImageNet. The local learning networks all feature forward weights W and backward weights B; the forward weights (+ nonlinearities) pass the information to lead to a classification; the backward weights pass the error, which is used to locally adjust the forward weights.

Hence, each fake neuron has locally the forward activation, the backward error (or loss gradient), the forward weight, backward weight, and Hebbian terms thereof (e.g the outer product of the in-out vectors for both forward and backward passes). From these available variables, they construct the local learning rules:

  • Decay (exponentially decay the backward weights)
  • Amp (Hebbian learning)
  • Null (decay based on the product of the weight and local activation. This effects a Euclidean norm on reconstruction.

Each of these serves as a "regularizer term" on the feedback weights, which governs their learning dynamics. In the case of backprop, the backward weights B are just the instantaneous transpose of the forward weights W. A good local learning rule approximates this transpose progressively. They show that, with proper hyperparameter setting, this does indeed work nearly as well as backprop when training a ResNet-18 network.

But, hyperparameter settings don't translate to other network topologies. To allow this, they add in non-local learning rules:

  • Sparse (penalizes the Euclidean norm of the previous layer; gradient is the outer product of the (current layer activation &transpose) * B)
  • Self (directly measures the forward weights and uses them to update the backward weights)

In "Symmetric Alignment", the Self and Decay rules are employed. This is similar to backprop (the backward weights will track the forward ones) with L2 regularization, which is not new. It performs very similarly to backprop. In "Activation Alignment", Amp and Sparse rules are employed. I assume this is supposed to be more biologically plausible -- the Hebbian term can track the forward weights, while the Sparse rule regularizes and stabilizes the learning, such that overall dynamics allow the gradient to flow even if W and B aren't transposes of each other.

Surprisingly, they find that Symmetric Alignment to be more robust to the injection of Gaussian noise during training than backprop. Both SA and AA achieve similar accuracies on the ResNet benchmark. The authors then go on to explain the plausibility of non-local but approximate learning rules with Regression discontinuity design ala Spiking allows neurons to estimate their causal effect.


This is a decent paper,reasonably well written. They thought trough what variables are available to affect learning, and parameterized five combinations that work. Could they have done the full matrix of combinations, optimizing just they same as the metaparameters? Perhaps, but that would be even more work ...

Regarding the desire to reconcile backprop and biology, this paper does not bring us much (if at all) closer. Biological neural networks have specific and local uses for error; even invoking 'error' has limited explanatory power on activity. Learning and firing dynamics, of course of course. Is the brain then just an overbearing mess of details and overlapping rules? Yes probably but that doesn't mean that we human's can't find something simpler that works. The algorithms in this paper, for example, are well described by a bit of linear algebra, and yet they are performant.

{1539}
hide / / print
ref: -0 tags: saab EPC date: 03-22-2021 01:29 gmt revision:0 [head]

https://webautocats.com/epc/saab/sbd/ -- Online, free parts look-up for Saab cars. Useful.

{1538}
hide / / print
ref: -2010 tags: neural signaling rate code patch clamp barrel cortex date: 03-18-2021 18:41 gmt revision:0 [head]

PMID-20596024 Sensitivity to perturbations in vivo implies high noise and suggests rate coding in cortex

  • How did I not know of this paper before.
  • Solid study showing that, while a single spike can elicit 28 spikes in post-synaptic neurons, the associated level of noise is indistinguishable from intrinsic noise.
  • Hence the cortex should communicate / compute in rate codes or large synchronized burst firing.
    • They found large bursts to be infrequent, timing precision to be low, hence rate codes.
    • Of course other examples, e.g auditory cortex, exist.

Cortical reliability amid noise and chaos

  • Noise is primarily of synaptic origin. (Dropout)
  • Recurrent cortical connectivity supports sensitivity to precise timing of thalamocortical inputs.

{1537}
hide / / print
ref: -0 tags: cortical computation learning predictive coding reviews date: 02-23-2021 20:15 gmt revision:2 [1] [0] [head]

PMID-30359606 Predictive Processing: A Canonical Cortical Computation

  • Georg B Keller, Thomas D Mrsic-Flogel
  • Their model includes on two error signals: positive and negative for reconciling the sensory experience with the top-down predictions. I haven't read the full article, and disagree that such errors are explicit to the form of neurons, but the model is plausible. Hence worth recording the paper here.

PMID-23177956 Canonical microcircuits for predictive coding

  • Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, Karl J Friston
  • We revisit the established idea that message passing among hierarchical cortical areas implements a form of Bayesian inference-paying careful attention to the implications for intrinsic connections among neuronal populations.
  • Have these algorithms been put to practical use? I don't know...

Control of synaptic plasticity in deep cortical networks

  • Pieter R. Roelfsema & Anthony Holtmaat
  • Basically argue for a many-factor learning rule at the feedforward and feedback synapses, taking into account pre, post, attention, and reinforcement signals.
  • See comment by Tim Lillicrap and Blake Richards.

{1536}
hide / / print
ref: -0 tags: protein engineering structure evolution date: 02-23-2021 19:57 gmt revision:1 [0] [head]

From Protein Structure to Function with Bioinformatics

  • Dense and useful resource!
  • Few new folds have been discovered since 2010 -- the total number of extand protein folds is around 100,000. Evolution re-uses existing folds + the protein fold space is highly convergent. Amazing. link

{1532}
hide / / print
ref: -2013 tags: larkum calcium spikes dendrites association cortex binding date: 02-23-2021 19:52 gmt revision:3 [2] [1] [0] [head]

PMID-23273272 A cellular mechanism for cortical associations: and organizing principle for the cerebral cortex

  • Distal tuft dendrites have a second spike-initiation zone, where depolarization can induce a calcium plateau of up to 50ms long.  This depolarization can cause multiple spikes in the soma, and can be more effective at inducing spikes than depolarization through the basal dendrites.  Such spikes are frequently bursts of 2-4 at 200hz. 
  • Bursts of spikes can also be triggered by backpropagation activated calcium (BAC), which can half the current threshold for a dendritic spike. That is, there is enough signal propagation for information to propagate both down the dendritic arbor and up, and the two interact non-linearly.  
  • This nonlinear calcium-dependent association pairing can be blocked by inhibition to the dendrites (presumably apical?). 
    • Larkum argues that the different timelines of GABA inhibition offer 'exquisite control' of the dendrites; but these sorts of arguments as to computational power always seem lame compared to stating what their actual role might be. 
  • Quote: "Dendritic calcium spikes have been recorded in vivo [57, 84, 85] that correlate to behavior [78, 86].  The recordings are population-level, though, and do not seem to measure individual dendrites (?). 

See also:

PMID-25174710 Sensory-evoked LTP driven by dendritic plateau potentials in vivo

  • We demonstrate that rhythmic sensory whisker stimulation efficiently induces synaptic LTP in layer 2/3 (L2/3) pyramidal cells in the absence of somatic spikes.
  • It instead depends on NMDA-dependent dendritic spikes.
  • And this is dependent on afferents from the POm thalamus.

And: The binding solution?, a blog post covering Bittner 2015 that looks at rapid dendritic plasticity in the hippocampus as a means of binding stimuli to place fields.

{1523}
hide / / print
ref: -0 tags: tennenbaum compositional learning character recognition one-shot learning date: 02-23-2021 18:56 gmt revision:2 [1] [0] [head]

One-shot learning by inverting a compositional causal process

  • Brenden Lake, Russ Salakhutdinov, Josh Tennenbaum
  • This is the paper that preceded the 2015 Science publication "Human level concept learning through probabalistic program induction"
  • Because it's a NIPS paper, and not a science paper, this one is a bit more accessible: the logic to the details and developments is apparent.
  • General idea: build up a fully probabilistic model of multi-language (omniglot corpus) characters / tokens. This model includes things like character type / alphabet, number of strokes, curvature of strokes (parameterized via bezier splines), where strokes attach to others (spatial relations), stroke scale, and character scale. The model (won't repeat the formal definition) is factorized to be both compositional and causal, though all the details of the conditional probs are left to the supplemental material.
  • They fit the complete model to the Omniglot data using gradient descent + image-space noising, e.g tweak the free parameters of the model to generate images that look like the human created characters. (This too is in the supplement).
  • Because the model is high-dimensional and hard to invert, they generate a perceptual model by winnowing down the image into a skeleton, then breaking this into a variable number of strokes.
    • The probabilistic model then assigns a log-likelihood to each of the parses.
    • They then use the model with Metropolis-Hastings MCMC to sample a region in parameter space around each parse -- and they extra sample ψ\psi (the character type) to get a greater weighted diversity of types.
      • Surprisingly, they don't estimate the image likelihood - which is expensive - they here just re-do the parsing based on aggregate info embedded in the statistical model. Clever.
  • ψ\psi is the character type (a, b, c..), ψ=κ,S,R\psi = { \kappa, S, R } where kappa are the number of strokes, S is a set of parameterized strokes, R are the relations between strokes.
  • θ\theta are the per-token stroke parameters.
  • II is the image, obvi.
  • Classification task: one image of a new character (c) vs 20 characters new characters from the same alphabet (test, (t)). In the 20 there is one character of the same type -- task is to find it.
  • With 'hierarchical bayesian program learning', they not only anneal the type to the parameters (with MCMC, above) for the test image, but they also fit the parameters using gradient descent to the image.
    • Subsequently parses the test image onto the class image (c)
    • Hence the best classification is the one where both are in the best agreement: argmaxcP(c|t)P(c)P(t|c)\underset{c}{argmax} \frac{P(c|t)}{P(c)} P(t|c) where P(c)P(c) is approximated as the parse weights.
      • Again, this is clever as it allows significant information leakage between (c) and (t) ...
      • The other models (Affine, Deep Boltzman Machines, Hierarchical Deep Model) have nothing like this -- they are feed-forward.
  • No wonder HBPL performs better. It's a better model of the data, that has a bidirectional fitting routine.

  • As i read the paper, had a few vague 'hedons':
    • Model building is essential. But unidirectional models are insufficient; if the models include the mechanism for their own inversion many fitting and inference problems are solved. (Such is my intuition)
      • As a corrolary of this, having both forward and backward tags (links) can be used to neatly solve the binding problem. This should be easy in a computer w/ pointers, though in the brain I'm not sure how it might work (?!) without some sort of combinatorial explosion?
    • The fitting process has to be multi-pass or at least re-entrant. Both this paper and the Vicarious CAPTCHA paper feature statistical message passing to infer or estimate hidden explanatory variables. Seems correct.
    • The model here includes relations that are conditional on stroke parameters that occurred / were parsed beforehand; this is very appealing in that the model/generator/AI needs to be flexibly re-entrant to support hierarchical planning ...

{1526}
hide / / print
ref: -0 tags: neuronal assemblies maass hebbian plasticity simulation austria fMRI date: 02-23-2021 18:49 gmt revision:1 [0] [head]

PMID-32381648 A model for structured information representation in neural networks in the brain

  • Using randomly connected E/I networks, suggests that information can be "bound" together using fast Hebbian STDP.
  • That is, 'assemblies' in higher-level areas reference sensory information through patterns of bidirectional connectivity.
  • These patterns emerge spontaneously following disinihbition of the higher-level areas.
  • Find the results underwhelming, but the discussion is more interesting.
    • E.g. there have been a lot of theoretical and computational-experimental work for how concepts are bound together into symbols or grammars.
    • The referenced fMRI studies are interesting, too: they imply that you can observe the results of structural binding in activity of the superior temporal gyrus.
  • I'm more in favor of dendritic potentials or neuronal up/down states to be a fast and flexible way of maintaining 'symbol membership' --
    • But it's not as flexible as synaptic plasticity, which, obviously, populates the outer product between 'region a' and 'region b' with a memory substrate, thereby spanning the range of plausible symbol-bindings.
    • Inhibitory interneurons can then gate the bindings, per morphological evidence.
    • But but, I don't think anyone has shown that you need protein synthesis for perception, as you do for LTP (modulo AMPAR cycling).
      • Hence I'd argue that localized dendritic potentials can serve as the flexible outer-product 'memory tag' for presence in an assembly.
        • Or maybe they are used primarily for learning, who knows!

{1535}
hide / / print
ref: -2019 tags: deep double descent lottery ticket date: 02-23-2021 18:47 gmt revision:2 [1] [0] [head]

Reconciling modern machine-learning practice and the classical bias–variance trade-off

A formal publication of the effect famously discovered at OpenAI & publicized on their blog. Goes into some details on fourier features & runs experiments to verify the OpenAI findings. The result stands.

An interesting avenue of research is using genetic algorithms to perform the search over neural network parameters (instead of backprop) in reinforcement-learning tasks. Ben Phillips has a blog post on some of the recent results, which show that it does work for certain 'hard' problems in RL. Of course, this is the dual of the 'lottery ticket' hypothesis and the deep double descent, above; large networks are likely to have solutions 'close enough' to solve a given problem.

That said, genetic algorithms don't necessarily perform gradient descent to tweak the weights for optimal behaviror once they are within the right region of RL behavior. See {1530} for more discussion on this topic, as well as {1525} for a more complete literature survey.

{1534}
hide / / print
ref: -2020 tags: current opinion in neurobiology Kriegeskorte review article deep learning neural nets circles date: 02-23-2021 17:40 gmt revision:2 [1] [0] [head]

Going in circles is the way forward: the role of recurrence in visual inference

I think the best part of this article are the references -- a nicely complete listing of, well, the current opinion in Neurobiology! (Note that this issue is edited by our own Karel Svoboda, hence there are a good number of Janelians in the author list..)

The gestalt of the review is that deep neural networks need to be recurrent, not purely feed-forward. This results in savings in overall network size, and increase in the achievable computational complexity, perhaps via the incorporation of priors and temporal-spatial information. All this again makes perfect sense and matches my sense of prevailing opinion. Of course, we are left wanting more: all this recurrence ought to be structured in some way.

To me, a rather naive way of thinking about it is that feed-forward layers cause weak activations, which are 'amplified' or 'selected for' in downstream neurons. These neurons proximally code for 'causes' or local reasons, based on the supported hypothesis that the brain has a good temporal-spatial model of the visuo-motor world. The causes then can either explain away the visual input, leading to balanced E-I, or fail to explain it, in which the excess activity is either rectified by engaging more circuits or engaging synaptic plasticity.

A critical part of this hypothesis is some degree of binding / disentanglement / spatio-temporal re-assignment. While not all models of computation require registers / variables -- RNNs are Turning-complete, e.g., I remain stuck on the idea that, to explain phenomenological experience and practical cognition, the brain much have some means of 'binding'. A reasonable place to look is the apical tuft dendrites, which are capable of storing temporary state (calcium spikes, NMDA spikes), undergo rapid synaptic plasticity, and are so dense that they can reasonably store the outer-product space of binding.

There is mounting evidence for apical tufts working independently / in parallel is investigations of high-gamma in ECoG: PMID-32851172 Dissociation of broadband high-frequency activity and neuronal firing in the neocortex. "High gamma" shows little correlation with MUA when you differentiate early-deep and late-superficial responses, "consistent with the view it reflects dendritic processing separable from local neuronal firing"

{1533}
hide / / print
ref: -2009 tags: Baldwin effect finches date: 02-22-2021 17:35 gmt revision:0 [head]

Evolutionary significance of phenotypic accommodation in novel environments: an empirical test of the Baldwin effect

Up until reading this, I had thought that the Balwin effect refers to the fact that when animals gain an ability to learn, this allows them to take new ecological roles without genotypic adaptation. This is a component of the effect, but is not the original meaning, which is opposite: when species adapt to a novel environment through phenotypic adptation (say adapting to colder weather through within-lifetime variation), evolution tends to push these changes into the germ line. This is something to the effect of Lamarkian evolution.

In the case of house finches, as discussed in the link above, this pertains to increased brood variability and sexual dimorphism due to varied maternal habits and hormones due to environmental stress. This variance is then rapidly operated on by natural selection to tune the finch to it's new enviroment, including Montana, where the single author did most of his investigation.

There are of course countless other details here, but still this is an illuminating demonstration of how evolution works to move information into the genome.

{1531}
hide / / print
ref: -2013 tags: synaptic learning rules calcium harris stdp date: 02-18-2021 19:48 gmt revision:3 [2] [1] [0] [head]

PMID-24204224 The Convallis rule for unsupervised learning in cortical networks 2013 - Pierre Yger  1 , Kenneth D Harris

This paper aims to unify and reconcile experimental evidence of in-vivo learning rules with  established STDP rules.  In particular, the STDP rule fails to accurately predict change in strength in response to spike triplets, e.g. pre-post-pre or post-pre-post.  Their model instead involves the competition between two time-constant threshold circuits / coincidence detectors, one which controls LTD and another LTP, and is such an extension of the classical BCM rule.  (BCM: inputs below a threshold will weaken a synapse; those above it will strengthen. )

They derive the model from optimization criteria that neurons should try to optimize the skewedness of the distribution of their membrane potential: much time spent either firing spikes or strongly inhibited.  This maps to a objective function F that looks like a valley - hence the 'convallis' in the name (latin for valley); the objective is differentiated to yield a weighting function for weight changes; they also add a shrinkage function (line + heaviside function) to gate weight changes 'off' at resting membrane potential. 

A network of firing neurons successfully groups correlated rate-encoded inputs, better than the STDP rule.  it can also cluster auditory inputs of spoken digits converted into cochleogram.  But this all seems relatively toy-like: of course algorithms can associate inputs that co-occur.  The same result was found for a recurrent balanced E-I network with the same cochleogram, and convalis performed better than STDP.   Meh.

Perhaps the biggest thing I got from the paper was how poorly STDP fares with spike triplets:

Pre following post does not 'necessarily' cause LTD; it's more complicated than that, and more consistent with the two different-timeconstant coincidence detectors.  This is satisfying as it allows for apical dendritic depolarization to serve as a contextual binding signal - without negatively impacting the associated synaptic weights. 

{1530}
hide / / print
ref: -2017 tags: deep neuroevolution jeff clune Uber genetic algorithms date: 02-18-2021 18:27 gmt revision:1 [0] [head]

Deep Neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning* Uber AI labs; Jeff Clune.

  • In this paper, they used a (fairly generic) genetic algorithm to tune the weights of a relatively large (4M parameters) convolutional neural net to play 13 atari games. 
  • The GA used truncation selection, population of ~ 1k individuals, no crossover, and gaussian mutation.
  • To speed up and streamline this algo, they encoded the weights not directly but as an initialization seed to the RNG (log2 of the number of parameters, approximately), plus seeds to generate the per-generation mutation (~ 28 bits).  This substantially decreased the required storage space and communication costs when running the GA in parallel on their cluster; they only had to transmit the rng seed sequence. 
  • Quite surprisingly, the GA was good at typically 'hard' games like frostbite and skiing, whereas it fared poorly on games like atlantis (which is a fixed-gun shooter game) and assault
  • Performance was compared to Deep-Q-networks (DQN), Evolutionary search (which used stochastic gradient approximates), Asynchronous Advantage Actor-critic (A3C), and random search (RS)
  • They surmise that some games were thought to be hard, but are actually fairly easy, albeit with many local minima. This is why search around the origin (near the initialization of the networks, which was via the Xavier method) is sufficient to solve the tasks.
  • Also noted that frequently the GA would find individuals with good performance in ~10 generations, further supporting the point above. 
  • The GA provide very consistent performance across the entirety of a trial, which, they suggest, may offer a cleaner signal to selection as to the quality of each of the individuals (debatable!).
  • Of course, for some tasks, the GA fails woefully; it was not able to quickly learn to control a humanoid robot, which involves mapping a ~370-dimensional vector into ~17 joint torques.  Evolutionary search was able to perform this task, which is not surprising as the gradient here should be smooth.

The result is indeed surprising, but it also feels lazy -- the total effort or information that they put into writing the actual algorithm is small; as mentioned in the introduction, this is a case of old algorithms with modern levels of compute.  Analogously, compare Go-Explore, also by Uber AI labs, vs Agent57 by DeepMind; the Agent57 paper blithely dismisses the otherwise breathless Go-Explore result as feature engineering and unrealistic free backtracking / game-resetting (which is true..) It's strange that they did not incorporate crossover aka recombination, as David MacKay clearly shows that recombination allows for much higher mutation rates and much better transmission of information through a population.  (Chapter 'Why have sex').  They also perhaps more reasonably omit developmental encoding, where network weights are tied or controlled through development, again in an analogy to biology. 

A better solution, as they point out, would be some sort of hybrid GA / ES / A3C system which used both gradient-based tuning, random stochastic gradient-based exploration, and straight genetic optimization, possibly all in parallel, with global selection as the umbrella.  They mention this, but to my current knowledge this has not been done. 

{1529}
hide / / print
ref: -2020 tags: dreamcoder ellis program induction ai date: 02-01-2021 18:39 gmt revision:0 [head]

DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning

  • Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, Joshua B. Tenenbaum

This paper describes a system for adaptively finding programs which succinctly and accurately produce desired output.  These desired outputs are provided by the user / test system, and come from a number of domains:

  • list (as in lisp) processing,
  • text editing,
  • regular expressions,
  • line graphics,
  • 2d lego block stacking,
  • symbolic regression (ish),
  • functional programming,
  • and physcial laws.  
Some of these domains are naturally toy-like, eg. the text processing, but others are deeply impressive: the system was able to "re-derive" basic physical laws of vector calculus in the process of looking for S-expression forms of cheat-sheet physics equations.  These advancements result from a long lineage of work, perhaps starting from the Helmholtz machine PMID-7584891 introduced by Peter Dayan, Geoff Hinton and others, where onemodel is trained to generate patterns given context (e.g.) while a second recognition module is trained to invert this model: derive context from the patterns.  The two work simultaneously to allow model-exploration in high dimensions.  

Also in the lineage is the EC2 algorithm, which most of the same authors above published in 2018.  EC2 centers around the idea of "explore - compress" : explore solutions to your program induction problem during the 'wake' phase, then compress the observed programs into a library by extracting/factoring out commonalities during the 'sleep' phase.  This of course is one of the core algorithms of human learning: explore options, keep track of both what worked and what didn't, search for commonalities among the options & their effects, and use these inferred laws or heuristics to further guide search and goal-setting, thereby building a buffer attack the curse of dimensionality.  Making the inferred laws themselves functions in a programming library allows hierarchically factoring the search task, making exploration of unbounded spaces possible.  This advantage is unique to the program synthesis approach. 

This much is said in the introduction, though perhaps with more clarity.  DreamCoder is an improved, more-accessible version of EC2, though the underlying ideas are the same.   It differs in that the method for constructing libraries has improved through the addition of a powerful version space for enumerating and evaluating refactors of the solutions generated during the wake phase.  (I will admit that I don't much understand the version space system.)  This version space allows DreamCoder to collapse the search space for re-factorings by many orders of magnitude, and seems to be a clear advancement.  Furthermore, DreamCoder incorporates a second phase of sleep: "dreaming", hence the moniker.  During dreaming the library is used to create 'dreams' consisting of combinations of the library primitives, which are then executed with training data as input.  These dreams are then used to train up a neural network to predict which library and atomic objects to use in given contexts.  Context in this case is where in the parse tree a given object has been inserted (it's parent and which argument number it sits in); how the data-context is incorporated to make this decision is not clear to me (???). 

This neural dream and replay-trained neural network is either a GRU recurrent net with 64 hidden states, or a convolutional network feeding into a RNN.  The final stage is a linear ReLu (???) which again is not clear how it feeds into the prediction of "which unit to use when".  The authors clearly demonstrate that the network, or the probabalistic context-free grammar that it controls (?) is capable of straightforward optimizations, like breaking symmetries due to commutativity, avoiding adding zero, avoiding multiplying by one, etc.  Beyond this, they do demonstrate via an ablation study that the presence of the neural network affords significant algorithmic leverage in all of the problem domains tested.  The network also seems to learn a reasonable representation of the sub-type of task encountered -- but a thorough investigation of how it works, or how it might be made to work better, remains desired. 

I've spent a little time looking around the code, which is a mix of python high-level experimental control code, and lower-level OCaml code responsible for running (emulating) the lisp-like DSL, inferring type in it's polymorphic system / reconciling types in evaluated program instances, maintaining the library, and recompressing it using aforementioned version spaces.  The code, like many things experimental, is clearly a work-in progress, with some old or unused code scattered about, glue to run the many experiments & record / analyze the data, and personal notes from the first author for making his job talks (! :).  The description in the supplemental materials, which is satisfyingly thorough (if again impenetrable wrt version spaces), is readily understandable, suggesting that one (presumably the first) author has a clear understanding of the system.  It doesn't appear that much is being hidden or glossed over, which is not the case for all scientific papers. 


With the caveat that I don't claim to understand the system to completion, there are some clear areas where the existing system could be augmented further.  The 'recognition' or perceptual module, which guides actual synthesis of candidate programs, realistically can use as much information as is available in DreamCoder as is available: full lexical and semantic scope, full input-output specifications, type information, possibly runtime binding of variables when filling holes.  This is motivated by the way that humans solve problems, at least as observed by introspection:
  • Examine problem, specification; extract patterns (via perceptual modules)
  • Compare patterns with existing library (memory) of compositionally-factored 'useful solutions' (this is identical to the library in DreamCoder)* Do something like beam-search or quasi stochastic search on selected useful solutions.  This is the same as DreamCoder, however human engineers make decisions progressively, at runtime so-to-speak: you fill not one hole per cycle, but many holes.  The addition of recursion to DreamCoder, provided a wider breadth of input information, could support this functionality. 
  • Run the program to observe input-output .. but also observe the inner workings of the program, eg. dataflow patterns.  These dataflow patterns are useful to human engineers when both debugging and when learning-by-inspection what library elements do.   DreamCoder does not really have this facility. 
  • Compare the current program results to the desired program output.  Make a stochastic decision whether to try to fix it, or to try another beam in the search.  Since this would be on a computer, this could be in parallel (as DreamCoder is); the ability to 'fix' or change a DUT is directly absent dreamcoder.   As an 'deeply philosophical' aside, this loop itself might be the effect of running a language-of-thought program, as was suggested by pioneers in AI (ref).  The loop itself is subject to modification and replacement based on goal-seeking success in the domain of interest, in a deeply-satisfying and deeply recursive manner ...
At each stage in the pipeline, the perceptual modules would have access to relevant variables in the current problem-solving context.  This is modeled on Jacques Pitrat's work.  Humans of course are even more flexible than that -- context includes roughly the whole brain, and if anything we're mushy on which level of the hierarchy we are working. 

Critical to making this work is to have, as I've written in my notes many years ago, a 'self compressing and factorizing memory'.  The version space magic + library could be considered a working example of this.  In the realm of ANNs, per recent OpenAI results with CLIP and Dall-E, really big transformers also seem to have strong compositional abilities, with the caveat that they need to be trained on segments of the whole web.  (This wouldn't be an issue here, as Dreamcoder generates a lot of its own training data via dreams).  Despite the data-inefficiency of DNN / transformers, they should be sufficient for making something in the spirit of above work, with a lot of compute, at least until more efficient models are available (which they should be shortly; see AlphaZero vs MuZero). 

{1528}
hide / / print
ref: -2015 tags: olshausen redwood autoencoder VAE MNIST faces variation date: 11-27-2020 03:04 gmt revision:0 [head]

Discovering hidden factors of variation in deep networks

  • Well, they are not really that deep ...
  • Use a VAE to encode both a supervised signal (class labels) as well as unsupervised latents.
  • Penalize a combination of the MSE of reconstruction, logits of the classification error, and a special cross-covariance term to decorrelate the supervised and unsupervised latent vectors.
  • Cross-covariance penalty:
  • Tested on
    • MNIST -- discovered style / rotation of the characters
    • Toronto faces database -- seven expressions, many individuals; extracted eigen-emotions sorta.
    • Multi-PIE --many faces, many viewpoints ; was able to vary camera pose and illumination with the unsupervised latents.