m8ta
use https for features.
text: sort by
tags: modified
type: chronology
{1539}
hide / / print
ref: -0 tags: saab EPC date: 03-22-2021 01:29 gmt revision:0 [head]

https://webautocats.com/epc/saab/sbd/ -- Online, free parts look-up for Saab cars. Useful.

{1538}
hide / / print
ref: -2010 tags: neural signaling rate code patch clamp barrel cortex date: 03-18-2021 18:41 gmt revision:0 [head]

PMID-20596024 Sensitivity to perturbations in vivo implies high noise and suggests rate coding in cortex

  • How did I not know of this paper before.
  • Solid study showing that, while a single spike can elicit 28 spikes in post-synaptic neurons, the associated level of noise is indistinguishable from intrinsic noise.
  • Hence the cortex should communicate / compute in rate codes or large synchronized burst firing.
    • They found large bursts to be infrequent, timing precision to be low, hence rate codes.
    • Of course other examples, e.g auditory cortex, exist.

Cortical reliability amid noise and chaos

  • Noise is primarily of synaptic origin. (Dropout)
  • Recurrent cortical connectivity supports sensitivity to precise timing of thalamocortical inputs.

{1537}
hide / / print
ref: -0 tags: cortical computation learning predictive coding reviews date: 02-23-2021 20:15 gmt revision:2 [1] [0] [head]

PMID-30359606 Predictive Processing: A Canonical Cortical Computation

  • Georg B Keller, Thomas D Mrsic-Flogel
  • Their model includes on two error signals: positive and negative for reconciling the sensory experience with the top-down predictions. I haven't read the full article, and disagree that such errors are explicit to the form of neurons, but the model is plausible. Hence worth recording the paper here.

PMID-23177956 Canonical microcircuits for predictive coding

  • Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, Karl J Friston
  • We revisit the established idea that message passing among hierarchical cortical areas implements a form of Bayesian inference-paying careful attention to the implications for intrinsic connections among neuronal populations.
  • Have these algorithms been put to practical use? I don't know...

Control of synaptic plasticity in deep cortical networks

  • Pieter R. Roelfsema & Anthony Holtmaat
  • Basically argue for a many-factor learning rule at the feedforward and feedback synapses, taking into account pre, post, attention, and reinforcement signals.
  • See comment by Tim Lillicrap and Blake Richards.

{1536}
hide / / print
ref: -0 tags: protein engineering structure evolution date: 02-23-2021 19:57 gmt revision:1 [0] [head]

From Protein Structure to Function with Bioinformatics

  • Dense and useful resource!
  • Few new folds have been discovered since 2010 -- the total number of extand protein folds is around 100,000. Evolution re-uses existing folds + the protein fold space is highly convergent. Amazing. link

{1532}
hide / / print
ref: -2013 tags: larkum calcium spikes dendrites association cortex binding date: 02-23-2021 19:52 gmt revision:3 [2] [1] [0] [head]

PMID-23273272 A cellular mechanism for cortical associations: and organizing principle for the cerebral cortex

  • Distal tuft dendrites have a second spike-initiation zone, where depolarization can induce a calcium plateau of up to 50ms long.  This depolarization can cause multiple spikes in the soma, and can be more effective at inducing spikes than depolarization through the basal dendrites.  Such spikes are frequently bursts of 2-4 at 200hz. 
  • Bursts of spikes can also be triggered by backpropagation activated calcium (BAC), which can half the current threshold for a dendritic spike. That is, there is enough signal propagation for information to propagate both down the dendritic arbor and up, and the two interact non-linearly.  
  • This nonlinear calcium-dependent association pairing can be blocked by inhibition to the dendrites (presumably apical?). 
    • Larkum argues that the different timelines of GABA inhibition offer 'exquisite control' of the dendrites; but these sorts of arguments as to computational power always seem lame compared to stating what their actual role might be. 
  • Quote: "Dendritic calcium spikes have been recorded in vivo [57, 84, 85] that correlate to behavior [78, 86].  The recordings are population-level, though, and do not seem to measure individual dendrites (?). 

See also:

PMID-25174710 Sensory-evoked LTP driven by dendritic plateau potentials in vivo

  • We demonstrate that rhythmic sensory whisker stimulation efficiently induces synaptic LTP in layer 2/3 (L2/3) pyramidal cells in the absence of somatic spikes.
  • It instead depends on NMDA-dependent dendritic spikes.
  • And this is dependent on afferents from the POm thalamus.

And: The binding solution?, a blog post covering Bittner 2015 that looks at rapid dendritic plasticity in the hippocampus as a means of binding stimuli to place fields.

{1523}
hide / / print
ref: -0 tags: tennenbaum compositional learning character recognition one-shot learning date: 02-23-2021 18:56 gmt revision:2 [1] [0] [head]

One-shot learning by inverting a compositional causal process

  • Brenden Lake, Russ Salakhutdinov, Josh Tennenbaum
  • This is the paper that preceded the 2015 Science publication "Human level concept learning through probabalistic program induction"
  • Because it's a NIPS paper, and not a science paper, this one is a bit more accessible: the logic to the details and developments is apparent.
  • General idea: build up a fully probabilistic model of multi-language (omniglot corpus) characters / tokens. This model includes things like character type / alphabet, number of strokes, curvature of strokes (parameterized via bezier splines), where strokes attach to others (spatial relations), stroke scale, and character scale. The model (won't repeat the formal definition) is factorized to be both compositional and causal, though all the details of the conditional probs are left to the supplemental material.
  • They fit the complete model to the Omniglot data using gradient descent + image-space noising, e.g tweak the free parameters of the model to generate images that look like the human created characters. (This too is in the supplement).
  • Because the model is high-dimensional and hard to invert, they generate a perceptual model by winnowing down the image into a skeleton, then breaking this into a variable number of strokes.
    • The probabilistic model then assigns a log-likelihood to each of the parses.
    • They then use the model with Metropolis-Hastings MCMC to sample a region in parameter space around each parse -- and they extra sample ψ\psi (the character type) to get a greater weighted diversity of types.
      • Surprisingly, they don't estimate the image likelihood - which is expensive - they here just re-do the parsing based on aggregate info embedded in the statistical model. Clever.
  • ψ\psi is the character type (a, b, c..), ψ=κ,S,R\psi = { \kappa, S, R } where kappa are the number of strokes, S is a set of parameterized strokes, R are the relations between strokes.
  • θ\theta are the per-token stroke parameters.
  • II is the image, obvi.
  • Classification task: one image of a new character (c) vs 20 characters new characters from the same alphabet (test, (t)). In the 20 there is one character of the same type -- task is to find it.
  • With 'hierarchical bayesian program learning', they not only anneal the type to the parameters (with MCMC, above) for the test image, but they also fit the parameters using gradient descent to the image.
    • Subsequently parses the test image onto the class image (c)
    • Hence the best classification is the one where both are in the best agreement: argmaxcP(c|t)P(c)P(t|c)\underset{c}{argmax} \frac{P(c|t)}{P(c)} P(t|c) where P(c)P(c) is approximated as the parse weights.
      • Again, this is clever as it allows significant information leakage between (c) and (t) ...
      • The other models (Affine, Deep Boltzman Machines, Hierarchical Deep Model) have nothing like this -- they are feed-forward.
  • No wonder HBPL performs better. It's a better model of the data, that has a bidirectional fitting routine.

  • As i read the paper, had a few vague 'hedons':
    • Model building is essential. But unidirectional models are insufficient; if the models include the mechanism for their own inversion many fitting and inference problems are solved. (Such is my intuition)
      • As a corrolary of this, having both forward and backward tags (links) can be used to neatly solve the binding problem. This should be easy in a computer w/ pointers, though in the brain I'm not sure how it might work (?!) without some sort of combinatorial explosion?
    • The fitting process has to be multi-pass or at least re-entrant. Both this paper and the Vicarious CAPTCHA paper feature statistical message passing to infer or estimate hidden explanatory variables. Seems correct.
    • The model here includes relations that are conditional on stroke parameters that occurred / were parsed beforehand; this is very appealing in that the model/generator/AI needs to be flexibly re-entrant to support hierarchical planning ...

{1526}
hide / / print
ref: -0 tags: neuronal assemblies maass hebbian plasticity simulation austria fMRI date: 02-23-2021 18:49 gmt revision:1 [0] [head]

PMID-32381648 A model for structured information representation in neural networks in the brain

  • Using randomly connected E/I networks, suggests that information can be "bound" together using fast Hebbian STDP.
  • That is, 'assemblies' in higher-level areas reference sensory information through patterns of bidirectional connectivity.
  • These patterns emerge spontaneously following disinihbition of the higher-level areas.
  • Find the results underwhelming, but the discussion is more interesting.
    • E.g. there have been a lot of theoretical and computational-experimental work for how concepts are bound together into symbols or grammars.
    • The referenced fMRI studies are interesting, too: they imply that you can observe the results of structural binding in activity of the superior temporal gyrus.
  • I'm more in favor of dendritic potentials or neuronal up/down states to be a fast and flexible way of maintaining 'symbol membership' --
    • But it's not as flexible as synaptic plasticity, which, obviously, populates the outer product between 'region a' and 'region b' with a memory substrate, thereby spanning the range of plausible symbol-bindings.
    • Inhibitory interneurons can then gate the bindings, per morphological evidence.
    • But but, I don't think anyone has shown that you need protein synthesis for perception, as you do for LTP (modulo AMPAR cycling).
      • Hence I'd argue that localized dendritic potentials can serve as the flexible outer-product 'memory tag' for presence in an assembly.
        • Or maybe they are used primarily for learning, who knows!

{1535}
hide / / print
ref: -2019 tags: deep double descent lottery ticket date: 02-23-2021 18:47 gmt revision:2 [1] [0] [head]

Reconciling modern machine-learning practice and the classical bias–variance trade-off

A formal publication of the effect famously discovered at OpenAI & publicized on their blog. Goes into some details on fourier features & runs experiments to verify the OpenAI findings. The result stands.

An interesting avenue of research is using genetic algorithms to perform the search over neural network parameters (instead of backprop) in reinforcement-learning tasks. Ben Phillips has a blog post on some of the recent results, which show that it does work for certain 'hard' problems in RL. Of course, this is the dual of the 'lottery ticket' hypothesis and the deep double descent, above; large networks are likely to have solutions 'close enough' to solve a given problem.

That said, genetic algorithms don't necessarily perform gradient descent to tweak the weights for optimal behaviror once they are within the right region of RL behavior. See {1530} for more discussion on this topic, as well as {1525} for a more complete literature survey.

{1534}
hide / / print
ref: -2020 tags: current opinion in neurobiology Kriegeskorte review article deep learning neural nets circles date: 02-23-2021 17:40 gmt revision:2 [1] [0] [head]

Going in circles is the way forward: the role of recurrence in visual inference

I think the best part of this article are the references -- a nicely complete listing of, well, the current opinion in Neurobiology! (Note that this issue is edited by our own Karel Svoboda, hence there are a good number of Janelians in the author list..)

The gestalt of the review is that deep neural networks need to be recurrent, not purely feed-forward. This results in savings in overall network size, and increase in the achievable computational complexity, perhaps via the incorporation of priors and temporal-spatial information. All this again makes perfect sense and matches my sense of prevailing opinion. Of course, we are left wanting more: all this recurrence ought to be structured in some way.

To me, a rather naive way of thinking about it is that feed-forward layers cause weak activations, which are 'amplified' or 'selected for' in downstream neurons. These neurons proximally code for 'causes' or local reasons, based on the supported hypothesis that the brain has a good temporal-spatial model of the visuo-motor world. The causes then can either explain away the visual input, leading to balanced E-I, or fail to explain it, in which the excess activity is either rectified by engaging more circuits or engaging synaptic plasticity.

A critical part of this hypothesis is some degree of binding / disentanglement / spatio-temporal re-assignment. While not all models of computation require registers / variables -- RNNs are Turning-complete, e.g., I remain stuck on the idea that, to explain phenomenological experience and practical cognition, the brain much have some means of 'binding'. A reasonable place to look is the apical tuft dendrites, which are capable of storing temporary state (calcium spikes, NMDA spikes), undergo rapid synaptic plasticity, and are so dense that they can reasonably store the outer-product space of binding.

There is mounting evidence for apical tufts working independently / in parallel is investigations of high-gamma in ECoG: PMID-32851172 Dissociation of broadband high-frequency activity and neuronal firing in the neocortex. "High gamma" shows little correlation with MUA when you differentiate early-deep and late-superficial responses, "consistent with the view it reflects dendritic processing separable from local neuronal firing"

{1533}
hide / / print
ref: -2009 tags: Baldwin effect finches date: 02-22-2021 17:35 gmt revision:0 [head]

Evolutionary significance of phenotypic accommodation in novel environments: an empirical test of the Baldwin effect

Up until reading this, I had thought that the Balwin effect refers to the fact that when animals gain an ability to learn, this allows them to take new ecological roles without genotypic adaptation. This is a component of the effect, but is not the original meaning, which is opposite: when species adapt to a novel environment through phenotypic adptation (say adapting to colder weather through within-lifetime variation), evolution tends to push these changes into the germ line. This is something to the effect of Lamarkian evolution.

In the case of house finches, as discussed in the link above, this pertains to increased brood variability and sexual dimorphism due to varied maternal habits and hormones due to environmental stress. This variance is then rapidly operated on by natural selection to tune the finch to it's new enviroment, including Montana, where the single author did most of his investigation.

There are of course countless other details here, but still this is an illuminating demonstration of how evolution works to move information into the genome.

{1531}
hide / / print
ref: -2013 tags: synaptic learning rules calcium harris stdp date: 02-18-2021 19:48 gmt revision:3 [2] [1] [0] [head]

PMID-24204224 The Convallis rule for unsupervised learning in cortical networks 2013 - Pierre Yger  1 , Kenneth D Harris

This paper aims to unify and reconcile experimental evidence of in-vivo learning rules with  established STDP rules.  In particular, the STDP rule fails to accurately predict change in strength in response to spike triplets, e.g. pre-post-pre or post-pre-post.  Their model instead involves the competition between two time-constant threshold circuits / coincidence detectors, one which controls LTD and another LTP, and is such an extension of the classical BCM rule.  (BCM: inputs below a threshold will weaken a synapse; those above it will strengthen. )

They derive the model from optimization criteria that neurons should try to optimize the skewedness of the distribution of their membrane potential: much time spent either firing spikes or strongly inhibited.  This maps to a objective function F that looks like a valley - hence the 'convallis' in the name (latin for valley); the objective is differentiated to yield a weighting function for weight changes; they also add a shrinkage function (line + heaviside function) to gate weight changes 'off' at resting membrane potential. 

A network of firing neurons successfully groups correlated rate-encoded inputs, better than the STDP rule.  it can also cluster auditory inputs of spoken digits converted into cochleogram.  But this all seems relatively toy-like: of course algorithms can associate inputs that co-occur.  The same result was found for a recurrent balanced E-I network with the same cochleogram, and convalis performed better than STDP.   Meh.

Perhaps the biggest thing I got from the paper was how poorly STDP fares with spike triplets:

Pre following post does not 'necessarily' cause LTD; it's more complicated than that, and more consistent with the two different-timeconstant coincidence detectors.  This is satisfying as it allows for apical dendritic depolarization to serve as a contextual binding signal - without negatively impacting the associated synaptic weights. 

{1530}
hide / / print
ref: -2017 tags: deep neuroevolution jeff clune Uber genetic algorithms date: 02-18-2021 18:27 gmt revision:1 [0] [head]

Deep Neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning* Uber AI labs; Jeff Clune.

  • In this paper, they used a (fairly generic) genetic algorithm to tune the weights of a relatively large (4M parameters) convolutional neural net to play 13 atari games. 
  • The GA used truncation selection, population of ~ 1k individuals, no crossover, and gaussian mutation.
  • To speed up and streamline this algo, they encoded the weights not directly but as an initialization seed to the RNG (log2 of the number of parameters, approximately), plus seeds to generate the per-generation mutation (~ 28 bits).  This substantially decreased the required storage space and communication costs when running the GA in parallel on their cluster; they only had to transmit the rng seed sequence. 
  • Quite surprisingly, the GA was good at typically 'hard' games like frostbite and skiing, whereas it fared poorly on games like atlantis (which is a fixed-gun shooter game) and assault
  • Performance was compared to Deep-Q-networks (DQN), Evolutionary search (which used stochastic gradient approximates), Asynchronous Advantage Actor-critic (A3C), and random search (RS)
  • They surmise that some games were thought to be hard, but are actually fairly easy, albeit with many local minima. This is why search around the origin (near the initialization of the networks, which was via the Xavier method) is sufficient to solve the tasks.
  • Also noted that frequently the GA would find individuals with good performance in ~10 generations, further supporting the point above. 
  • The GA provide very consistent performance across the entirety of a trial, which, they suggest, may offer a cleaner signal to selection as to the quality of each of the individuals (debatable!).
  • Of course, for some tasks, the GA fails woefully; it was not able to quickly learn to control a humanoid robot, which involves mapping a ~370-dimensional vector into ~17 joint torques.  Evolutionary search was able to perform this task, which is not surprising as the gradient here should be smooth.

The result is indeed surprising, but it also feels lazy -- the total effort or information that they put into writing the actual algorithm is small; as mentioned in the introduction, this is a case of old algorithms with modern levels of compute.  Analogously, compare Go-Explore, also by Uber AI labs, vs Agent57 by DeepMind; the Agent57 paper blithely dismisses the otherwise breathless Go-Explore result as feature engineering and unrealistic free backtracking / game-resetting (which is true..) It's strange that they did not incorporate crossover aka recombination, as David MacKay clearly shows that recombination allows for much higher mutation rates and much better transmission of information through a population.  (Chapter 'Why have sex').  They also perhaps more reasonably omit developmental encoding, where network weights are tied or controlled through development, again in an analogy to biology. 

A better solution, as they point out, would be some sort of hybrid GA / ES / A3C system which used both gradient-based tuning, random stochastic gradient-based exploration, and straight genetic optimization, possibly all in parallel, with global selection as the umbrella.  They mention this, but to my current knowledge this has not been done. 

{1529}
hide / / print
ref: -2020 tags: dreamcoder ellis program induction ai date: 02-01-2021 18:39 gmt revision:0 [head]

DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning

  • Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, Joshua B. Tenenbaum

This paper describes a system for adaptively finding programs which succinctly and accurately produce desired output.  These desired outputs are provided by the user / test system, and come from a number of domains:

  • list (as in lisp) processing,
  • text editing,
  • regular expressions,
  • line graphics,
  • 2d lego block stacking,
  • symbolic regression (ish),
  • functional programming,
  • and physcial laws.  
Some of these domains are naturally toy-like, eg. the text processing, but others are deeply impressive: the system was able to "re-derive" basic physical laws of vector calculus in the process of looking for S-expression forms of cheat-sheet physics equations.  These advancements result from a long lineage of work, perhaps starting from the Helmholtz machine PMID-7584891 introduced by Peter Dayan, Geoff Hinton and others, where onemodel is trained to generate patterns given context (e.g.) while a second recognition module is trained to invert this model: derive context from the patterns.  The two work simultaneously to allow model-exploration in high dimensions.  

Also in the lineage is the EC2 algorithm, which most of the same authors above published in 2018.  EC2 centers around the idea of "explore - compress" : explore solutions to your program induction problem during the 'wake' phase, then compress the observed programs into a library by extracting/factoring out commonalities during the 'sleep' phase.  This of course is one of the core algorithms of human learning: explore options, keep track of both what worked and what didn't, search for commonalities among the options & their effects, and use these inferred laws or heuristics to further guide search and goal-setting, thereby building a buffer attack the curse of dimensionality.  Making the inferred laws themselves functions in a programming library allows hierarchically factoring the search task, making exploration of unbounded spaces possible.  This advantage is unique to the program synthesis approach. 

This much is said in the introduction, though perhaps with more clarity.  DreamCoder is an improved, more-accessible version of EC2, though the underlying ideas are the same.   It differs in that the method for constructing libraries has improved through the addition of a powerful version space for enumerating and evaluating refactors of the solutions generated during the wake phase.  (I will admit that I don't much understand the version space system.)  This version space allows DreamCoder to collapse the search space for re-factorings by many orders of magnitude, and seems to be a clear advancement.  Furthermore, DreamCoder incorporates a second phase of sleep: "dreaming", hence the moniker.  During dreaming the library is used to create 'dreams' consisting of combinations of the library primitives, which are then executed with training data as input.  These dreams are then used to train up a neural network to predict which library and atomic objects to use in given contexts.  Context in this case is where in the parse tree a given object has been inserted (it's parent and which argument number it sits in); how the data-context is incorporated to make this decision is not clear to me (???). 

This neural dream and replay-trained neural network is either a GRU recurrent net with 64 hidden states, or a convolutional network feeding into a RNN.  The final stage is a linear ReLu (???) which again is not clear how it feeds into the prediction of "which unit to use when".  The authors clearly demonstrate that the network, or the probabalistic context-free grammar that it controls (?) is capable of straightforward optimizations, like breaking symmetries due to commutativity, avoiding adding zero, avoiding multiplying by one, etc.  Beyond this, they do demonstrate via an ablation study that the presence of the neural network affords significant algorithmic leverage in all of the problem domains tested.  The network also seems to learn a reasonable representation of the sub-type of task encountered -- but a thorough investigation of how it works, or how it might be made to work better, remains desired. 

I've spent a little time looking around the code, which is a mix of python high-level experimental control code, and lower-level OCaml code responsible for running (emulating) the lisp-like DSL, inferring type in it's polymorphic system / reconciling types in evaluated program instances, maintaining the library, and recompressing it using aforementioned version spaces.  The code, like many things experimental, is clearly a work-in progress, with some old or unused code scattered about, glue to run the many experiments & record / analyze the data, and personal notes from the first author for making his job talks (! :).  The description in the supplemental materials, which is satisfyingly thorough (if again impenetrable wrt version spaces), is readily understandable, suggesting that one (presumably the first) author has a clear understanding of the system.  It doesn't appear that much is being hidden or glossed over, which is not the case for all scientific papers. 


With the caveat that I don't claim to understand the system to completion, there are some clear areas where the existing system could be augmented further.  The 'recognition' or perceptual module, which guides actual synthesis of candidate programs, realistically can use as much information as is available in DreamCoder as is available: full lexical and semantic scope, full input-output specifications, type information, possibly runtime binding of variables when filling holes.  This is motivated by the way that humans solve problems, at least as observed by introspection:
  • Examine problem, specification; extract patterns (via perceptual modules)
  • Compare patterns with existing library (memory) of compositionally-factored 'useful solutions' (this is identical to the library in DreamCoder)* Do something like beam-search or quasi stochastic search on selected useful solutions.  This is the same as DreamCoder, however human engineers make decisions progressively, at runtime so-to-speak: you fill not one hole per cycle, but many holes.  The addition of recursion to DreamCoder, provided a wider breadth of input information, could support this functionality. 
  • Run the program to observe input-output .. but also observe the inner workings of the program, eg. dataflow patterns.  These dataflow patterns are useful to human engineers when both debugging and when learning-by-inspection what library elements do.   DreamCoder does not really have this facility. 
  • Compare the current program results to the desired program output.  Make a stochastic decision whether to try to fix it, or to try another beam in the search.  Since this would be on a computer, this could be in parallel (as DreamCoder is); the ability to 'fix' or change a DUT is directly absent dreamcoder.   As an 'deeply philosophical' aside, this loop itself might be the effect of running a language-of-thought program, as was suggested by pioneers in AI (ref).  The loop itself is subject to modification and replacement based on goal-seeking success in the domain of interest, in a deeply-satisfying and deeply recursive manner ...
At each stage in the pipeline, the perceptual modules would have access to relevant variables in the current problem-solving context.  This is modeled on Jacques Pitrat's work.  Humans of course are even more flexible than that -- context includes roughly the whole brain, and if anything we're mushy on which level of the hierarchy we are working. 

Critical to making this work is to have, as I've written in my notes many years ago, a 'self compressing and factorizing memory'.  The version space magic + library could be considered a working example of this.  In the realm of ANNs, per recent OpenAI results with CLIP and Dall-E, really big transformers also seem to have strong compositional abilities, with the caveat that they need to be trained on segments of the whole web.  (This wouldn't be an issue here, as Dreamcoder generates a lot of its own training data via dreams).  Despite the data-inefficiency of DNN / transformers, they should be sufficient for making something in the spirit of above work, with a lot of compute, at least until more efficient models are available (which they should be shortly; see AlphaZero vs MuZero). 

{1528}
hide / / print
ref: -2015 tags: olshausen redwood autoencoder VAE MNIST faces variation date: 11-27-2020 03:04 gmt revision:0 [head]

Discovering hidden factors of variation in deep networks

  • Well, they are not really that deep ...
  • Use a VAE to encode both a supervised signal (class labels) as well as unsupervised latents.
  • Penalize a combination of the MSE of reconstruction, logits of the classification error, and a special cross-covariance term to decorrelate the supervised and unsupervised latent vectors.
  • Cross-covariance penalty:
  • Tested on
    • MNIST -- discovered style / rotation of the characters
    • Toronto faces database -- seven expressions, many individuals; extracted eigen-emotions sorta.
    • Multi-PIE --many faces, many viewpoints ; was able to vary camera pose and illumination with the unsupervised latents.

{1527}
hide / / print
ref: -0 tags: inductive logic programming deepmind formal propositions prolog date: 11-21-2020 04:07 gmt revision:0 [head]

Learning Explanatory Rules from Noisy Data

  • From a dense background of inductive logic programming (ILP): given a set of statements, and rules for transformation and substitution, generate clauses that satisfy a set of 'background knowledge'.
  • Programs like Metagol can do this using search and simplify logic built into Prolog.
    • Actually kinda surprising how very dense this program is -- only 330 lines!
  • This task can be transformed into a SAT problem via rules of logic, for which there are many fast solvers.
  • The trick here (instead) is that a neural network is used to turn 'on' or 'off' clauses that fit the background knowledge
    • BK is typically very small, a few examples, consistent with the small size of the learned networks.
  • These weight matrices are represented as the outer product of composed or combined clauses, which makes the weight matrix very large!
  • They then do gradient descent, while passing the cross-entropy errors through nonlinearities (including clauses themselves? I think this is how recursion is handled.) to update the weights.
    • Hence, SGD is used as a means of heuristic search.
  • Compare this to Metagol, which is brittle to any noise in the input; unsurprisingly, due to SGD, this is much more robust.
  • Way too many words and symbols in this paper for what it seems to be doing. Just seems to be obfuscating the work (which is perfectly good). Again: Metagol is only 330 lines!

{1490}
hide / / print
ref: -2011 tags: two photon cross section fluorescent protein photobleaching Drobizhev gcamp date: 11-04-2020 18:07 gmt revision:9 [8] [7] [6] [5] [4] [3] [head]

PMID-21527931 Two-photon absorption properties of fluorescent proteins

  • Significant 2-photon cross section of red fluorescent proteins (same chromophore as DsRed) in the 700 - 770nm range, accessible to Ti:sapphire lasers ...
    • This corresponds to a S 0S nS_0 \rightarrow S_n transition
    • But but, photobleaching is an order of magnitude slower when excited by the direct S 0S 1S_0 \rightarrow S_1 transition (but the fluorophores can be significantly less bright in this regime).
      • Quote: the photobleaching of DsRed slows down by an order of magnitude when the excitation wavelength is shifted to the red, from 750 to 950 nm (32).
    • See also PMID-18027924
  • Further work by same authors: Absolute Two-Photon Absorption Spectra and Two-Photon Brightness of Orange and Red Fluorescent Proteins
    • " TagRFP possesses the highest two-photon cross section, σ2 = 315 GM, and brightness, σ2φ = 130 GM, where φ is the fluorescence quantum yield. At longer wavelengths, 1000–1100 nm, tdTomato has the largest values, σ2 = 216 GM and σ2φ = 120 GM, per protein chain. Compared to the benchmark EGFP, these proteins present 3–4 times improvement in two-photon brightness."
    • "Single-photon properties of the FPs are poor predictors of which fluorescent proteins will be optimal in two-photon applications. It follows that additional mutagenesis efforts to improve two-photon cross section will benefit the field."
  • 2P cross-section in both the 700-800nm and 1000-1100 nm range corresponds to the chromophore polarizability, and is not related to 1p cross section.
  • This can be useflu for multicolor imaging: excitation of the higher S0 → Sn transition of TagRFP simultaneously with the first, S0 → S1, transition of mKalama1 makes dual-color two-photon imaging possible with a single excitation laser wavelength (13)
  • Why are red GECIs based on mApple (rGECO1) or mRuby (RCaMP)? dsRed2 or TagRFP are much better .. but maybe they don't have CP variants.
  • from https://elifesciences.org/articles/12727

{1525}
hide / / print
ref: -0 tags: double descent complexity construction gradient descent date: 10-26-2020 03:23 gmt revision:6 [5] [4] [3] [2] [1] [0] [head]

Why deep learning works even though it shouldn't, instigated a fun thread thinking about "complexity of model" vs "complexity of solution".

  • The blog post starts from the position that modern deep learning should not work because the models are much too complex for the datasets they are trained on -- they should not generalize well.
    • Quote" why do models get better when they are bigger and deeper, even when the amount of data they consume stays the same or gets smaller."
  • Argument: in high-dimensional spaces, all solutions are about the same distance from each other. This means that high dimensional spaces are very well connected. (Seems hand-wavy?)
    • Sub-argument: with bilions of dimensions, it is exponentially unlikely that all gradients will be positive, e.g. you are in a local minimum. Much more likely that about half are positive, half are negative -> saddle.
    • This is of course looking at it in terms of gradient descent, which is not probably how biological systems build complexity. See also the saddle paper.
  • Claim: Early stopping is better regularization than any hand-picked a priori regularization, including implicit regularization like model size.
    • Well, maybe; stopping early of course is normally thought to prevent over-fitting or over-memorization of the dataset; but see also Double Descent, below.
    • Also: "that weight distributions are highly non-independent even after only a few hundred iterations" abstract of The Early phase of Neural Network Training
    • Or: "We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. "
  • Claim: SGD, ADAM, etc does not train to a minimum.
    • I think this is broadly supportable via the high-dimensional saddle argument.
    • He relates this to distillation: a large model can infer 'good structure', possibly via the good luck of having a very large parameter space; a small model can learn these features with fewer parameters, and hopefully there will be less 'nuisance' dimensions in the distilled data.
  • discussion on Hacker News

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

  • This paper, well written & insightful.
  • Core idea: train up a network, not necessarily to completion or zero test error.
  • Prune away the smallest ~90% of the weights. Pruning is not at all a new idea.
    • For larger networks, they propose iterative pruning: train for a while, prune away connections that don't matter, continue.
      • Does this sound like human neural development? Yes!
  • Re-start training from the initial weights, with most of the network pruned away. This network will train up faster, to equivalent accuracy, compared to orinial full network.
  • This seems to work well for MNIST and CIFAR10.
  • From this, they hypothesize that within a large network there is a 'lottery ticket' sub-network that can be trained well to represent the training / test dataset well.
    • "The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective"
  • However, either pruning the network (setting the weights to zero) before training, or re-initializing the weights in the trained network from the initialization distribution does not work.
  • "Dense, randomly-initialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket"
    • The blessing of dimensionality!
  • Complementary with dropout, at least the iterative pruning regime.
  • But only with a slow learning rate (?) or learning rate warmup for deeper nets.
  • Very complete appendix, as neccessitated by the submission to ICLR. Within it there is a little The truth wears off effect (or: more caves of complexity)

Stabilizing the lottery ticket hypothesis

  • With deeper neural networks, you can't prune away most of the weights before at least some training has occurred.
  • Instead, train the network partly, then do iterative magnitude pruning to select important weights.
  • Even with early training, this works well up to 80% sparsity on Imagenet.
  • Given the previous results, this doesn't seem so surprising..

OpenAI Deep Double Descent

  • Original phenomena discovered in Reconciling modern machine learning practice and the bias-variance trade-off
  • Why is bigger always better?
  • Another well-written and easily understood post.
  • At the interpolation threshold, there are relatively few models that fit the training data well, and label noise can easily mess up their global structure; beyond this interpolation threshold, there are many good models, and SGD somehow has implicit bias (??) to select models that are parsimonious and hence generalize well.
    • This despite the fact that classical statistics would suggest that the models are very over-parameterized.
    • Maybe it's the noise (the S in SGD) which acts as a regularizer? That plus the fact that the networks imperfectly represent structure in the data?
      • When there is near-zero training error, what does SGD do ??
  • Understanding deep double descent
    • Quote: but it still leaves what is in my opinion the most important question unanswered, which is: what exactly are the magical inductive biases of modern ML that make interpolation work so well?
  • Alternate hypothesis, from lesser wrong: ensembling improves generalization. "Which is something we've known for a long time".
    • the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occam’s Razor works...
      • That’s exactly correct. You can prove it via the Laplace approximation: the “width” of the peak in each principal direction is the inverse of an eigenvalue of the Hessian, and each eigenvalue λ i\lambda_i contributes 12log(λ i)-\frac{ 1}{ 2}log(\lambda_i) to the marginal log likelihood logP[data|model]log P[data|model] . So, if a peak is twice as wide in one direction, its marginal log likelihood is higher by 12log(2)\frac{ 1}{ 2}log(2) , or half a bit. For models in which the number of free parameters is large relative to the number of data points (i.e. the interesting part of the double-descent curve), this is the main term of interest in the marginal log likelihood.
      • Ensembling does not explain the lottery ticket hypothesis.

  • Critical learning periods in deep neural networks
    • Per above, it also does not explain this result -- that the trace of the Fisher Information Matrix goes up then down with training; the SGD consolidates the weights so that 'fewer matter'.
    • FIM, reminding myself: the expected value [ of the derivative [ of the log-likelihood function, f(data; parameters)]] , which is all a function of the parameters.
      • Expected value is taken over the data.
      • Derivative is with respect to the parameters. partial derivative = score; high score = data has a high local dependence on parameters, or equivalently, the parameters should be easier to estimate.
      • log-likelihood because that's the way it is; or: probabilities are best understood in decibels.

  • Understanding deep-learning requires re-thinking generalization
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals
  • state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data
    • 95.2% accuracy is still very surprising for a million random labels from 1000 categories.
    • Training time increases by a small scalar factor with random labels.
    • Regularization via weight decay, dropout, and data augmentation do not eliminate fitting of random labels.
  • Works even when the true images are replaced by random noise too.
  • Depth two neural networks have perfect sample expressivity as soon as parameters > data points.
  • The second bit of meat in the paper is section 5, Implicit regularization: an appeal to linear models.
  • Have n data points x i,y i{x_i,y_i} where x ix_i are d-dimensional feature vectors, and y iy_i are the labels.
    • if we want to solve the fitting problem, min w TelemR dΣ i=1 nloss(w Tx i,y i) min_{w^T \elem R^d} \Sigma_{i=1}^{n} loss(w^T x_i,y_i) -- this is just linear regression, and if d > n, can fit exactly.
    • The hessian of this function is degenerate -- the curvature is meaningless, and does not inform generalization.
  • With SGD, w t+1=w tηe tx i tw_{t+1} = w_t - \eta e_t x_{i_t} where e te_t is the prediction error.
  • If we start at w=0, w=Σ i=1 nα ix iw = \Sigma_{i=1}^{n} \alpha_i x_i for some coefficients α\alpha .
  • Hence, w=X Tαw = X^T \alpha -- the weights are in the span of the data points.
  • If we interpolate perfectly, then Xw=y X w = y
  • Substitute, and get XX Tα=y X X^T \alpha = y
    • This is the "kernel trick" (Scholkopf et al 2001)
    • Depends only on all the dot-products between all the datapoints -- it's a n*n linear system that can be solved exactly for small sets. (not pseudo-inverse!)
    • On mnist, this results in a 1.2% test error (!)
    • With gabor wavelet pre-processing, the the error is 0.6% !
  • Out of all models, SGD will converge to the model with the minimum norm (without weight decay)
    • Norm is only a small part of the generalization puzzle.

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

Random deep neural networks are biased towards simple functions

Reconciling modern machine learning practice and the bias-variance trade-off

{1524}
hide / / print
ref: -2020 tags: replay hippocampus variational autoencoder date: 10-11-2020 04:09 gmt revision:1 [0] [head]

Brain-inspired replay for continual learning with artificial neural networks

  • Gudo M van de Ven, Hava Siegelmann, Andreas Tolias
  • In the real world, samples are not replayed in shuffled order -- they occur in a sequence, typically few times. Hence, for training an ANN (or NN?), you need to 'replay' samples.
    • Perhaps, to get at hidden structure not obvious on first pass through the sequence.
    • In the brain, reactivation / replay likely to stabilize memories.
      • Strong evidence that this occurs through sharp-wave ripples (or the underlying activity associated with this).
  • Replay is also used to combat a common problem in training ANNs - catastrophic forgetting.
    • Generally you just re-sample from your database (easy), though in real-time applications, this is not possible.
      • It might also take a lot of memory (though that is cheap these days) or violate privacy (though again who cares about that)

  • They study two different classification problems:
    • Task incremental learning (Task-IL)
      • Agent has to serially learn distinct tasks
      • OK for Atari, doesn't make sense for classification
    • Class incremental learning (Class-IL)
      • Agent has to learn one task incrementally, one/few classes at a time.
      • Like learning a 2 digits at a time in MNIST
        • But is tested on all digits shown so far.
  • Solved via Generative Replay (GR, ~2017)
  • Use a recursive formulation: 'old' generative model is used to generate samples, which are then classified and fed, interleaved with the new samples, to the new network being trained.
    • 'Old' samples can be infrequent -- it's easier to reinforce an existing memory rather than create a new one.
    • Generative model is a VAE.
  • Compared with some existing solutions to catastrophic forgetting:
    • Methods to protect parameters in the network important for previous tasks
      • Elastic weight consolidation (EWC)
      • Synaptic intelligence (SI)
        • Both methods maintain estimates of how influential parameters were for previous tasks, and penalize changes accordingly.
        • "metaplasticity"
        • Synaptic intelligence: measure the loss change relative to the individual weights.
        • δL=δLδθδθδtδt \delta L = \int \frac{\delta L}{\delta \theta} \frac{\delta \theta}{\delta t} \delta t ; converted into discrete time / SGD: L=Σ kω k=ΣδLδθδθδtδt L = \Sigma_k \omega_k = \Sigma \int \frac{\delta L}{\delta \theta} \frac{\delta \theta}{\delta t} \delta t
        • ω k\omega_k are then the weightings for how much parameter change contributed to the training improvement.
        • Use this as a per-parameter regularization strength, scaled by one over the square of 'how far it moved'.
        • This is added to the loss, so that the network is penalized for moving important weights.
    • Context-dependent gating (XdG)
      • To reduce interference between tasks, a random subset of neurons is gated off (inhibition), depending on the task.
    • Learning without forgetting (LwF)
      • Method replays current task input after labeling them (incorrectly?) using the model trained on the previous tasks.
  • Generative replay works on Class-IL!
  • And is robust -- not to many samples or hidden units needed (for MNIST)

  • Yet the generative replay system does not scale to CIFAR or permuted MNIST.
  • E.g. if you take the MNIST pixels, permute them based on a 'task', and ask a network to still learn the character identities , it can't do it ... though synaptic intelligence can.
  • Their solution is to make 'brain-inspired' modifications to the network:
    • RtF, Replay-though-feedback: the classifier and generator network are fused. Latent vector is the hippocampus. Cortex is the VAE / classifier.
    • Con, Conditional replay: normal prior for the VAE is replaced with multivariate class-conditional Gaussian.
      • Not sure how they sample from this, check the methods.
    • Gat, Gating based on internal context.
      • Gating is only applied to the feedback layers, since for classification ... you don't a priori know the class!
    • Int, Internal replay. This is maybe the most interesting: rather than generating pixels, feedback generates hidden layer activations.
      • First layer of a network is convolutional, dependent on visual feature statistics, and should not change much.
        • Indeed, for CIFAR, they use pre-trained layers.
      • Internal replay proved to be very important!
    • Dist, Soft target labeling of the generated targets; cross-entropy loss when training the classifier on generated samples. Aka distillation.
  • Results suggest that regularization / metaplasticity (keeping memories in parameter space) and replay (keeping memories in function space) are complementary strategies,
    • And that the brain uses both to create and protect memories.

  • When I first read this paper, it came across as a great story -- well thought out, well explained, a good level of detail, and sufficiently supported by data / lesioning experiments.
  • However, looking at the first authors pub record, it seems that he's been at this for >2-3 years ... things take time to do & publish.
  • Folding in of the VAE is satisfying -- taking one function approximator and use it to provide memory for another function approximator.
  • Also satisfying are the neurological inspirations -- and that full feedback to the pixel level was not required!
    • Maybe the hippocampus does work like this, providing high-level feature vectors to the cortex.
    • And it's likely that the cortex has some features of a VAE, e.g. able to perceive and imagine through the same nodes, just run in different directions.
      • The fact that both concepts led to an engineering solution is icing on the cake!

{1522}
hide / / print
ref: -2017 tags: schema networks reinforcement learning atari breakout vicarious date: 09-29-2020 02:32 gmt revision:2 [1] [0] [head]

Schema networks: zero-shot transfer with a generative causal model of intuitive physics

  • Like a lot of papers, the title has more flash than the actual results.
  • Results which would be state of the art (as of 2017) in playing Atari breakout, then transferring performance to modifications of the game (paddle moved up a bit, wall added in the middle of the bricks, brick respawning, juggling).
  • Schema network is based on 'entities' (objects) which have binary 'attributes'. These attributes can include continuous-valued signals, in which case each binary variable is like a place fields (i think).
    • This is clever an interesting -- rather than just low-level features pointing to high-level features, this means that high-level entities can have records of low-level features -- an arrow pointing in the opposite direction, one which can (also) be learned.
    • The same idea is present in other Vicarious work, including the CAPTCHA paper and more-recent (and less good) Bio-RNN paper.
  • Entities and attributes are propagated forward in time based on 'ungrounded schemas' -- basically free-floating transition matrices. The grounded schemas are entities and action groups that have evidence in observation.
    • There doesn't seem to be much math describing exactly how this works; only exposition. Or maybe it's all hand-waving over the actual, much simpler math.
      • Get the impression that the authors are reaching to a level of formalism when in fact they just made something that works for the breakout task... I infer Dileep prefers the empirical for the formal, so this is likely primarily the first author.
  • There are no perceptual modules here -- game state is fed to the network directly as entities and attributes (and, to be fair, to the A3C model).
  • Entity-attributes vectors are concatenated into a column vector length NTNT , where NN are the number of entities, and TT are time slices.
    • For each entity of N over time T, a row-vector is made of length MRMR , where MM are the number of attributes (fixed per task) and R1R-1 are the number of neighbors in a fixed radius. That is, each entity is related to its neighbors attributes over time.
    • This is a (large, sparse) binary matrix, XX .
  • yy is the vector of actions; task is to predict actions from XX .
    • How is X learned?? Very unclear in the paper vs. figure 2.
  • The solution is approximated as y=XW1¯y = X W \bar{1 } where WW is a binary weight matrix.
    • Minimize the solution based on an objective function on the error and the complexity of ww .
    • This is found via linear programming relaxation. "This procedure monotonically decreases the prediction error of the overall schema network, while increasing its complexity".
      • As it's a issue of binary conjunctions, this seems like a SAT problem!
    • Note that it's not probabilistic: "For this algorithm to work, no contradictions can exist in the input data" -- they instead remove them!
  • Actual behavior includes maximum-product belief propagation, to look for series of transitions that set the reward variable without setting the fail variable.
    • Because the network is loopy, this has to occur several times to set entity variables eg & includes backtracking.

  • Have there been any further papers exploring schema networks? What happened to this?
  • The later paper from Vicarious on zero-shot task transfer are rather less interesting (to me) than this.

{1521}
hide / / print
ref: -2005 tags: dimensionality reduction contrastive gradient descent date: 09-13-2020 02:49 gmt revision:2 [1] [0] [head]

Dimensionality reduction by learning and invariant mapping

  • Raia Hadsell, Sumit Chopra, Yann LeCun
  • Central idea: learn and invariant mapping of the input by minimizing mapped distance (e.g. the distance between outputs) when the samples are categorized as the same (same numbers in MNIST eg), and maximizing mapped distance when the samples are categorized as distant.
    • Two loss functions for same vs different.
  • This is an attraction-repulsion spring analogy.
  • Use gradient descent to change the weights to satisfy these two competing losses.
  • Resulting constitutional neural nets can extract camera pose information from the NORB dataset.
  • Surprising how simple analogies like this, when iterated across a great many samples, pull out intuitively correct invariances.