m8ta
You are not authenticated, login. 

{1544}  
The HSIC Bottleneck: Deep learning without Backpropagation In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set perlayer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbertschmidt independence criterion, as the independence measure. The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the layer representation and the labels while minimizing the mutual information between the representation and the input: $\frac{min}{P_{T_i  X}} I(X; T_i)  \beta I(T_i; Y)$ Where $T_i$ is the hidden representation at layer i (later output), $X$ is the layer input, and $Y$ are the labels. By replacing $I()$ with the HSIC, and some derivation (?), they show that $HSIC(D) = (m1)^{2} tr(K_X H K_Y H)$ Where $D = {(x_1,y_1), ... (x_m, y_m)}$ are samples and labels, $K_{X_{ij}} = k(x_i, x_j)$ and $K_{Y_{ij}} = k(y_i, y_j)$  that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, $k(x,y) = exp(1/2 xy^2/\sigma^2)$ . So, if all the x and y are on average independent, then the innerproduct will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices. But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outerproduct spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this... For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3factor Hebbian learning in deep networks albeit in a much less intelligible way. Robust Learning with the HilbertSchmidt Independence Criterion Is another, later, paper using the HSIC. Their interpretation: "This lossfunction encourages learning models where the distribution of the residuals between the label and the model prediction is statistically independent of the distribution of the instances themselves." Hence, given above nomenclature, $E_X( P_{T_i  X} I(X ; T_i) ) = 0$ (I'm not totally sure about the weighting, but might be required given the definition of the HSIC.) As I understand it, the HSIC loss is a kernellized loss between the input, output, and labels that encourages a degree of invariance to input ('covariate shift'). This is useful, but I'm unconvinced that making the layer output independent of the input is absolutely essential (??)  
{1552}  
Modularizing Deep Learning via Pairwise Learning With Kernels
I think in general this is an important result, even if its not wholly unique / somewhat anticipated (it's a year old at the time of writing). Modular training of neural networks is great for efficiency, parallelization, and biological implementations! Transport of weights between layers is hence nonessential. Classes still are, but I wonder if temporal continuity can solve some of these problems? (There is plenty of other effort in this area  see also {1544})  
{1543} 
ref: 2019
tags: backprop neural networks deep learning coordinate descent alternating minimization
date: 07212021 03:07 gmt
revision:1
[0] [head]


Beyond Backprop: Online Alternating Minimization with Auxiliary Variables
This is interesting in that the weight updates can be cone in parallel  perhaps more efficient  but you are still propagating errors backward, albeit via optimizing 'codes'. Given the vast infractructure devoted to autodiff + backprop, I can't see this being adopted broadly. That said, the idea of alternating minimization (which is used eg for EM clustering) is powerful, and this paper does describe (though I didn't read it) how there are guarantees on the convexity of the alternating minimization. Likewise, the authors show how to improve the performance of the online / minibatch algorithm by keeping around memory variables, in the form of covariance matrices.  
{1535}  
Reconciling modern machinelearning practice and the classical biasâ€“variance tradeoff A formal publication of the effect famously discovered at OpenAI & publicized on their blog. Goes into some details on fourier features & runs experiments to verify the OpenAI findings. The result stands. An interesting avenue of research is using genetic algorithms to perform the search over neural network parameters (instead of backprop) in reinforcementlearning tasks. Ben Phillips has a blog post on some of the recent results, which show that it does work for certain 'hard' problems in RL. Of course, this is the dual of the 'lottery ticket' hypothesis and the deep double descent, above; large networks are likely to have solutions 'close enough' to solve a given problem. That said, genetic algorithms don't necessarily perform gradient descent to tweak the weights for optimal behaviror once they are within the right region of RL behavior. See {1530} for more discussion on this topic, as well as {1525} for a more complete literature survey.  
{1534}  
Going in circles is the way forward: the role of recurrence in visual inference I think the best part of this article are the references  a nicely complete listing of, well, the current opinion in Neurobiology! (Note that this issue is edited by our own Karel Svoboda, hence there are a good number of Janelians in the author list..) The gestalt of the review is that deep neural networks need to be recurrent, not purely feedforward. This results in savings in overall network size, and increase in the achievable computational complexity, perhaps via the incorporation of priors and temporalspatial information. All this again makes perfect sense and matches my sense of prevailing opinion. Of course, we are left wanting more: all this recurrence ought to be structured in some way. To me, a rather naive way of thinking about it is that feedforward layers cause weak activations, which are 'amplified' or 'selected for' in downstream neurons. These neurons proximally code for 'causes' or local reasons, based on the supported hypothesis that the brain has a good temporalspatial model of the visuomotor world. The causes then can either explain away the visual input, leading to balanced EI, or fail to explain it, in which the excess activity is either rectified by engaging more circuits or engaging synaptic plasticity. A critical part of this hypothesis is some degree of binding / disentanglement / spatiotemporal reassignment. While not all models of computation require registers / variables  RNNs are Turningcomplete, e.g., I remain stuck on the idea that, to explain phenomenological experience and practical cognition, the brain much have some means of 'binding'. A reasonable place to look is the apical tuft dendrites, which are capable of storing temporary state (calcium spikes, NMDA spikes), undergo rapid synaptic plasticity, and are so dense that they can reasonably store the outerproduct space of binding. There is mounting evidence for apical tufts working independently / in parallel is investigations of highgamma in ECoG: PMID32851172 Dissociation of broadband highfrequency activity and neuronal firing in the neocortex. "High gamma" shows little correlation with MUA when you differentiate earlydeep and latesuperficial responses, "consistent with the view it reflects dendritic processing separable from local neuronal firing"  
{1530} 
ref: 2017
tags: deep neuroevolution jeff clune Uber genetic algorithms
date: 02182021 18:27 gmt
revision:1
[0] [head]


Deep Neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning* Uber AI labs; Jeff Clune.
The result is indeed surprising, but it also feels lazy  the total effort or information that they put into writing the actual algorithm is small; as mentioned in the introduction, this is a case of old algorithms with modern levels of compute.Â Analogously, compare GoExplore, also by Uber AI labs, vs Agent57 by DeepMind; the Agent57 paper blithely dismisses the otherwise breathless GoExplore result as feature engineering and unrealistic free backtracking / gameresetting (which is true..) It's strange that they did not incorporate crossover aka recombination, as David MacKay clearly shows that recombination allows for much higher mutation rates and much better transmission of information through a population.Â (Chapter 'Why have sex').Â They also perhaps more reasonably omit developmental encoding, where network weights are tied or controlled through development, again in an analogy to biology.Â A better solution, as they point out, would be some sort of hybrid GA / ES / A3C system which used both gradientbased tuning, random stochastic gradientbased exploration, and straight genetic optimization, possibly all in parallel, with global selection as the umbrella.Â They mention this, but to my current knowledge this has not been done.Â  
{1527} 
ref: 0
tags: inductive logic programming deepmind formal propositions prolog
date: 11212020 04:07 gmt
revision:0
[head]


Learning Explanatory Rules from Noisy Data
 
{1510} 
ref: 2017
tags: google deepmind compositional variational autoencoder
date: 04082020 01:16 gmt
revision:7
[6] [5] [4] [3] [2] [1] [head]


SCAN: learning hierarchical compositional concepts
 
{1500}  
PMID31942076 A distributional code for value in dopamine based reinforcement learning
 
{1505}  
Scalable and sustainable deep learning via randomized hashing
 
{1482}  
Rapid learning or feature reuse? Towards understanding the effectiveness of MAML
 
{1441}  
Assessing the Scalability of BiologicallyMotivated Deep Learning Algorithms and Architectures
 
{1439} 
ref: 2006
tags: hinton contrastive divergence deep belief nets
date: 02202019 02:38 gmt
revision:0
[head]


PMID16764513 A fast learning algorithm for deep belief nets.
 
{1419}  
Alloptical machine learning using diffractive deep neural networks
 
{1174}  
Brains, sex, and machine learning  Hinton google tech talk.
 
{1422}  
PMID29205151 Towards deep learning with segregated dendrites https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5716677/
 
{1412} 
ref: 0
tags: deeplabcut markerless tracking DCN transfer learning
date: 10032018 23:56 gmt
revision:0
[head]


Markerless tracking of userdefined features with deep learning
 
{1408}  
LDMNet: Low dimensional manifold regularized neural nets.
 
{1333}  
 
{1269} 
ref: 0
tags: hinton convolutional deep networks image recognition 2012
date: 01112014 20:14 gmt
revision:0
[head]

