m8ta
You are not authenticated, login. 

{1545}  
Selforganizaton in a perceptual network
One may critically challenge the infomax idea: we very much need to (and do) throw away spurious or irrelevant information in our sensory streams; what upper layers 'care about' when making decisions is certainly relevant to the lower layers. This creditassignment is neatly solved by backprop, and there are a number 'biologically plausible' means of performing it, but both this and infomax are maybe avoiding the problem. What might the upper layers really care about? Likely 'care about' is an emergent property of the interacting local learning rules and network structure. Can you search directly in these domains, within biological limits, and motivated by statistical reality, to find unsupervisedlearning networks? You'll still need a way to rank the networks, hence an objective 'care about' function. Sigh. Either way, I don't per se put a lot of weight in the infomax principle. It could be useful, but is only part of the story. Otherwise Linsker's discussion is accessible, lucid, and prescient. Lol.  
{1544}  
The HSIC Bottleneck: Deep learning without Backpropagation In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set perlayer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbertschmidt independence criterion, as the independence measure. The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the output and the labels while minimizing the mutual information between the output and the labels: $\frac{min}{P_{T_i}  X)} I(X; T_i)  \Beta I(T_i; Y)$ Where $T_i$ is the hidden representation at layer i (later output), $X$ is the layer input, and $Y$ are the labels. By replacing $I()$ with the HSIC, and some derivation (?), they show that $HSIC(D) = (m1)^{2} tr(K_X H K_y H)$ Where $D = {(x_1,y_1), ... (x_m, y_m)}$ are samples and labels, $K_{X_{ij}} = k(x_i, x_j)$ and $K_{Y_{ij}} = k(y_i, y_j)$  that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, $k(x,y) = exp(1/2 xy^2/\sigma^2)$ . So, if all the x and y are on average independent, then the innerproduct will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices. But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outerproduct spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this... For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3factor Hebbian learning in deep networks albeit in a much less intelligable way.  
{1543} 
ref: 2019
tags: backprop neural networks deep learning coordinate descent alternating minimization
date: 07212021 03:07 gmt
revision:1
[0] [head]


Beyond Backprop: Online Alternating Minimization with Auxiliary Variables
This is interesting in that the weight updates can be cone in parallel  perhaps more efficient  but you are still propagating errors backward, albeit via optimizing 'codes'. Given the vast infractructure devoted to autodiff + backprop, I can't see this being adopted broadly. That said, the idea of alternating minimization (which is used eg for EM clustering) is powerful, and this paper does describe (though I didn't read it) how there are guarantees on the convexity of the alternating minimization. Likewise, the authors show how to improve the performance of the online / minibatch algorithm by keeping around memory variables, in the form of covariance matrices.  
{1542}  
https://github.com/wilicc/gpuburn Multgpu stress test. Are your GPUs overclocked to the point of overheating / being unreliable?  
{1541}  
Like this blog but 100% better!  
{1540}  
Two Routes to Scalable Credit Assignment without Weight Symmetry This paper looks at five different learning rules, three purely local, and two nonlocal, to see if they can work as well as backprop in training a deep convolutional net on ImageNet. The local learning networks all feature forward weights W and backward weights B; the forward weights (+ nonlinearities) pass the information to lead to a classification; the backward weights pass the error, which is used to locally adjust the forward weights. Hence, each fake neuron has locally the forward activation, the backward error (or loss gradient), the forward weight, backward weight, and Hebbian terms thereof (e.g the outer product of the inout vectors for both forward and backward passes). From these available variables, they construct the local learning rules:
Each of these serves as a "regularizer term" on the feedback weights, which governs their learning dynamics. In the case of backprop, the backward weights B are just the instantaneous transpose of the forward weights W. A good local learning rule approximates this transpose progressively. They show that, with proper hyperparameter setting, this does indeed work nearly as well as backprop when training a ResNet18 network. But, hyperparameter settings don't translate to other network topologies. To allow this, they add in nonlocal learning rules:
In "Symmetric Alignment", the Self and Decay rules are employed. This is similar to backprop (the backward weights will track the forward ones) with L2 regularization, which is not new. It performs very similarly to backprop. In "Activation Alignment", Amp and Sparse rules are employed. I assume this is supposed to be more biologically plausible  the Hebbian term can track the forward weights, while the Sparse rule regularizes and stabilizes the learning, such that overall dynamics allow the gradient to flow even if W and B aren't transposes of each other. Surprisingly, they find that Symmetric Alignment to be more robust to the injection of Gaussian noise during training than backprop. Both SA and AA achieve similar accuracies on the ResNet benchmark. The authors then go on to explain the plausibility of nonlocal but approximate learning rules with Regression discontinuity design ala Spiking allows neurons to estimate their causal effect. This is a decent paper,reasonably well written. They thought trough what variables are available to affect learning, and parameterized five combinations that work. Could they have done the full matrix of combinations, optimizing just they same as the metaparameters? Perhaps, but that would be even more work ... Regarding the desire to reconcile backprop and biology, this paper does not bring us much (if at all) closer. Biological neural networks have specific and local uses for error; even invoking 'error' has limited explanatory power on activity. Learning and firing dynamics, of course of course. Is the brain then just an overbearing mess of details and overlapping rules? Yes probably but that doesn't mean that we human's can't find something simpler that works. The algorithms in this paper, for example, are well described by a bit of linear algebra, and yet they are performant.  
{1539}  
https://webautocats.com/epc/saab/sbd/  Online, free parts lookup for Saab cars. Useful.  
{1538}  
PMID20596024 Sensitivity to perturbations in vivo implies high noise and suggests rate coding in cortex
Cortical reliability amid noise and chaos
 
{1537} 
ref: 0
tags: cortical computation learning predictive coding reviews
date: 02232021 20:15 gmt
revision:2
[1] [0] [head]


PMID30359606 Predictive Processing: A Canonical Cortical Computation
PMID23177956 Canonical microcircuits for predictive coding
Control of synaptic plasticity in deep cortical networks
 
{1536}  
From Protein Structure to Function with Bioinformatics
 
{1532}  
PMID23273272 A cellular mechanism for cortical associations: and organizing principle for the cerebral cortex
See also: PMID25174710 Sensoryevoked LTP driven by dendritic plateau potentials in vivo
And: The binding solution?, a blog post covering Bittner 2015 that looks at rapid dendritic plasticity in the hippocampus as a means of binding stimuli to place fields.  
{1523} 
ref: 0
tags: tennenbaum compositional learning character recognition oneshot learning
date: 02232021 18:56 gmt
revision:2
[1] [0] [head]


Oneshot learning by inverting a compositional causal process
 
{1526} 
ref: 0
tags: neuronal assemblies maass hebbian plasticity simulation austria fMRI
date: 02232021 18:49 gmt
revision:1
[0] [head]


PMID32381648 A model for structured information representation in neural networks in the brain
 
{1535}  
Reconciling modern machinelearning practice and the classical bias–variance tradeoff A formal publication of the effect famously discovered at OpenAI & publicized on their blog. Goes into some details on fourier features & runs experiments to verify the OpenAI findings. The result stands. An interesting avenue of research is using genetic algorithms to perform the search over neural network parameters (instead of backprop) in reinforcementlearning tasks. Ben Phillips has a blog post on some of the recent results, which show that it does work for certain 'hard' problems in RL. Of course, this is the dual of the 'lottery ticket' hypothesis and the deep double descent, above; large networks are likely to have solutions 'close enough' to solve a given problem. That said, genetic algorithms don't necessarily perform gradient descent to tweak the weights for optimal behaviror once they are within the right region of RL behavior. See {1530} for more discussion on this topic, as well as {1525} for a more complete literature survey.  
{1534}  
Going in circles is the way forward: the role of recurrence in visual inference I think the best part of this article are the references  a nicely complete listing of, well, the current opinion in Neurobiology! (Note that this issue is edited by our own Karel Svoboda, hence there are a good number of Janelians in the author list..) The gestalt of the review is that deep neural networks need to be recurrent, not purely feedforward. This results in savings in overall network size, and increase in the achievable computational complexity, perhaps via the incorporation of priors and temporalspatial information. All this again makes perfect sense and matches my sense of prevailing opinion. Of course, we are left wanting more: all this recurrence ought to be structured in some way. To me, a rather naive way of thinking about it is that feedforward layers cause weak activations, which are 'amplified' or 'selected for' in downstream neurons. These neurons proximally code for 'causes' or local reasons, based on the supported hypothesis that the brain has a good temporalspatial model of the visuomotor world. The causes then can either explain away the visual input, leading to balanced EI, or fail to explain it, in which the excess activity is either rectified by engaging more circuits or engaging synaptic plasticity. A critical part of this hypothesis is some degree of binding / disentanglement / spatiotemporal reassignment. While not all models of computation require registers / variables  RNNs are Turningcomplete, e.g., I remain stuck on the idea that, to explain phenomenological experience and practical cognition, the brain much have some means of 'binding'. A reasonable place to look is the apical tuft dendrites, which are capable of storing temporary state (calcium spikes, NMDA spikes), undergo rapid synaptic plasticity, and are so dense that they can reasonably store the outerproduct space of binding. There is mounting evidence for apical tufts working independently / in parallel is investigations of highgamma in ECoG: PMID32851172 Dissociation of broadband highfrequency activity and neuronal firing in the neocortex. "High gamma" shows little correlation with MUA when you differentiate earlydeep and latesuperficial responses, "consistent with the view it reflects dendritic processing separable from local neuronal firing"  
{1533}  
Up until reading this, I had thought that the Balwin effect refers to the fact that when animals gain an ability to learn, this allows them to take new ecological roles without genotypic adaptation. This is a component of the effect, but is not the original meaning, which is opposite: when species adapt to a novel environment through phenotypic adptation (say adapting to colder weather through withinlifetime variation), evolution tends to push these changes into the germ line. This is something to the effect of Lamarkian evolution. In the case of house finches, as discussed in the link above, this pertains to increased brood variability and sexual dimorphism due to varied maternal habits and hormones due to environmental stress. This variance is then rapidly operated on by natural selection to tune the finch to it's new enviroment, including Montana, where the single author did most of his investigation. There are of course countless other details here, but still this is an illuminating demonstration of how evolution works to move information into the genome.  
{1531}  
PMID24204224 The Convallis rule for unsupervised learning in cortical networks 2013  Pierre Yger 1 , Kenneth D Harris This paper aims to unify and reconcile experimental evidence of invivo learning rules with established STDP rules. In particular, the STDP rule fails to accurately predict change in strength in response to spike triplets, e.g. prepostpre or postprepost. Their model instead involves the competition between two timeconstant threshold circuits / coincidence detectors, one which controls LTD and another LTP, and is such an extension of the classical BCM rule. (BCM: inputs below a threshold will weaken a synapse; those above it will strengthen. ) They derive the model from optimization criteria that neurons should try to optimize the skewedness of the distribution of their membrane potential: much time spent either firing spikes or strongly inhibited. This maps to a objective function F that looks like a valley  hence the 'convallis' in the name (latin for valley); the objective is differentiated to yield a weighting function for weight changes; they also add a shrinkage function (line + heaviside function) to gate weight changes 'off' at resting membrane potential. A network of firing neurons successfully groups correlated rateencoded inputs, better than the STDP rule. it can also cluster auditory inputs of spoken digits converted into cochleogram. But this all seems relatively toylike: of course algorithms can associate inputs that cooccur. The same result was found for a recurrent balanced EI network with the same cochleogram, and convalis performed better than STDP. Meh. Perhaps the biggest thing I got from the paper was how poorly STDP fares with spike triplets: Pre following post does not 'necessarily' cause LTD; it's more complicated than that, and more consistent with the two differenttimeconstant coincidence detectors. This is satisfying as it allows for apical dendritic depolarization to serve as a contextual binding signal  without negatively impacting the associated synaptic weights.  
{1530} 
ref: 2017
tags: deep neuroevolution jeff clune Uber genetic algorithms
date: 02182021 18:27 gmt
revision:1
[0] [head]


Deep Neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning* Uber AI labs; Jeff Clune.
The result is indeed surprising, but it also feels lazy  the total effort or information that they put into writing the actual algorithm is small; as mentioned in the introduction, this is a case of old algorithms with modern levels of compute. Analogously, compare GoExplore, also by Uber AI labs, vs Agent57 by DeepMind; the Agent57 paper blithely dismisses the otherwise breathless GoExplore result as feature engineering and unrealistic free backtracking / gameresetting (which is true..) It's strange that they did not incorporate crossover aka recombination, as David MacKay clearly shows that recombination allows for much higher mutation rates and much better transmission of information through a population. (Chapter 'Why have sex'). They also perhaps more reasonably omit developmental encoding, where network weights are tied or controlled through development, again in an analogy to biology. A better solution, as they point out, would be some sort of hybrid GA / ES / A3C system which used both gradientbased tuning, random stochastic gradientbased exploration, and straight genetic optimization, possibly all in parallel, with global selection as the umbrella. They mention this, but to my current knowledge this has not been done.  
{1529}  
DreamCoder: Growing generalizable, interpretable knowledge with wakesleep Bayesian program learning
This paper describes a system for adaptively finding programs which succinctly and accurately produce desired output. These desired outputs are provided by the user / test system, and come from a number of domains:
Also in the lineage is the EC2 algorithm, which most of the same authors above published in 2018. EC2 centers around the idea of "explore  compress" : explore solutions to your program induction problem during the 'wake' phase, then compress the observed programs into a library by extracting/factoring out commonalities during the 'sleep' phase. This of course is one of the core algorithms of human learning: explore options, keep track of both what worked and what didn't, search for commonalities among the options & their effects, and use these inferred laws or heuristics to further guide search and goalsetting, thereby building a buffer attack the curse of dimensionality. Making the inferred laws themselves functions in a programming library allows hierarchically factoring the search task, making exploration of unbounded spaces possible. This advantage is unique to the program synthesis approach. This much is said in the introduction, though perhaps with more clarity. DreamCoder is an improved, moreaccessible version of EC2, though the underlying ideas are the same. It differs in that the method for constructing libraries has improved through the addition of a powerful version space for enumerating and evaluating refactors of the solutions generated during the wake phase. (I will admit that I don't much understand the version space system.) This version space allows DreamCoder to collapse the search space for refactorings by many orders of magnitude, and seems to be a clear advancement. Furthermore, DreamCoder incorporates a second phase of sleep: "dreaming", hence the moniker. During dreaming the library is used to create 'dreams' consisting of combinations of the library primitives, which are then executed with training data as input. These dreams are then used to train up a neural network to predict which library and atomic objects to use in given contexts. Context in this case is where in the parse tree a given object has been inserted (it's parent and which argument number it sits in); how the datacontext is incorporated to make this decision is not clear to me (???). This neural dream and replaytrained neural network is either a GRU recurrent net with 64 hidden states, or a convolutional network feeding into a RNN. The final stage is a linear ReLu (???) which again is not clear how it feeds into the prediction of "which unit to use when". The authors clearly demonstrate that the network, or the probabalistic contextfree grammar that it controls (?) is capable of straightforward optimizations, like breaking symmetries due to commutativity, avoiding adding zero, avoiding multiplying by one, etc. Beyond this, they do demonstrate via an ablation study that the presence of the neural network affords significant algorithmic leverage in all of the problem domains tested. The network also seems to learn a reasonable representation of the subtype of task encountered  but a thorough investigation of how it works, or how it might be made to work better, remains desired. I've spent a little time looking around the code, which is a mix of python highlevel experimental control code, and lowerlevel OCaml code responsible for running (emulating) the lisplike DSL, inferring type in it's polymorphic system / reconciling types in evaluated program instances, maintaining the library, and recompressing it using aforementioned version spaces. The code, like many things experimental, is clearly a workin progress, with some old or unused code scattered about, glue to run the many experiments & record / analyze the data, and personal notes from the first author for making his job talks (! :). The description in the supplemental materials, which is satisfyingly thorough (if again impenetrable wrt version spaces), is readily understandable, suggesting that one (presumably the first) author has a clear understanding of the system. It doesn't appear that much is being hidden or glossed over, which is not the case for all scientific papers. With the caveat that I don't claim to understand the system to completion, there are some clear areas where the existing system could be augmented further. The 'recognition' or perceptual module, which guides actual synthesis of candidate programs, realistically can use as much information as is available in DreamCoder as is available: full lexical and semantic scope, full inputoutput specifications, type information, possibly runtime binding of variables when filling holes. This is motivated by the way that humans solve problems, at least as observed by introspection:
Critical to making this work is to have, as I've written in my notes many years ago, a 'self compressing and factorizing memory'. The version space magic + library could be considered a working example of this. In the realm of ANNs, per recent OpenAI results with CLIP and DallE, really big transformers also seem to have strong compositional abilities, with the caveat that they need to be trained on segments of the whole web. (This wouldn't be an issue here, as Dreamcoder generates a lot of its own training data via dreams). Despite the datainefficiency of DNN / transformers, they should be sufficient for making something in the spirit of above work, with a lot of compute, at least until more efficient models are available (which they should be shortly; see AlphaZero vs MuZero).  
{1528}  
Discovering hidden factors of variation in deep networks
