Beyond Backprop: Online Alternating Minimization with Auxiliary Variables
 This paper is sortof interesting: rather than backpropagating the errors, you optimize auxiliary variables, prenonlinearity 'codes' in a lasttofirst layer order. The optimization is done to minimize a multimodal logistic loss function; math is not done to minimize other loss functions, but presumably this is not a limit. The loss function also includes a quadratic term on the weights.
 After the 'codes' are set, optimization can proceed in parallel on the weights. This is done with either straight SGD or adaptive ADAM.
 Weight L2 penalty is scheduled over time.
This is interesting in that the weight updates can be cone in parallel  perhaps more efficient  but you are still propagating errors backward, albeit via optimizing 'codes'. Given the vast infractructure devoted to autodiff + backprop, I can't see this being adopted broadly.
That said, the idea of alternating minimization (which is used eg for EM clustering) is powerful, and this paper does describe (though I didn't read it) how there are guarantees on the convexity of the alternating minimization. Likewise, the authors show how to improve the performance of the online / minibatch algorithm by keeping around memory variables, in the form of covariance matrices. 
Reconciling modern machinelearning practice and the classical biasâ€“variance tradeoff
A formal publication of the effect famously discovered at OpenAI & publicized on their blog. Goes into some details on fourier features & runs experiments to verify the OpenAI findings. The result stands.
An interesting avenue of research is using genetic algorithms to perform the search over neural network parameters (instead of backprop) in reinforcementlearning tasks. Ben Phillips has a blog post on some of the recent results, which show that it does work for certain 'hard' problems in RL. Of course, this is the dual of the 'lottery ticket' hypothesis and the deep double descent, above; large networks are likely to have solutions 'close enough' to solve a given problem.
That said, genetic algorithms don't necessarily perform gradient descent to tweak the weights for optimal behaviror once they are within the right region of RL behavior. See {1530} for more discussion on this topic, as well as {1525} for a more complete literature survey. 
Why deep learning works even though it shouldn't, instigated a fun thread thinking about "complexity of model" vs "complexity of solution".
 The blog post starts from the position that modern deep learning should not work because the models are much too complex for the datasets they are trained on  they should not generalize well.
 Quote" why do models get better when they are bigger and deeper, even when the amount of data they consume stays the same or gets smaller."
 Argument: in highdimensional spaces, all solutions are about the same distance from each other. This means that high dimensional spaces are very well connected. (Seems handwavy?)
 Subargument: with bilions of dimensions, it is exponentially unlikely that all gradients will be positive, e.g. you are in a local minimum. Much more likely that about half are positive, half are negative > saddle.
 This is of course looking at it in terms of gradient descent, which is not probably how biological systems build complexity. See also the saddle paper.
 Claim: Early stopping is better regularization than any handpicked a priori regularization, including implicit regularization like model size.
 Well, maybe; stopping early of course is normally thought to prevent overfitting or overmemorization of the dataset; but see also Double Descent, below.
 Also: "that weight distributions are highly nonindependent even after only a few hundred iterations" abstract of The Early phase of Neural Network Training
 Or: "We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. "
 Claim: SGD, ADAM, etc does not train to a minimum.
 I think this is broadly supportable via the highdimensional saddle argument.
 He relates this to distillation: a large model can infer 'good structure', possibly via the good luck of having a very large parameter space; a small model can learn these features with fewer parameters, and hopefully there will be less 'nuisance' dimensions in the distilled data.
 discussion on Hacker News
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
 This paper, well written & insightful.
 Core idea: train up a network, not necessarily to completion or zero test error.
 Prune away the smallest ~90% of the weights. Pruning is not at all a new idea.
 For larger networks, they propose iterative pruning: train for a while, prune away connections that don't matter, continue.
 Does this sound like human neural development? Yes!
 Restart training from the initial weights, with most of the network pruned away. This network will train up faster, to equivalent accuracy, compared to orinial full network.
 This seems to work well for MNIST and CIFAR10.
 From this, they hypothesize that within a large network there is a 'lottery ticket' subnetwork that can be trained well to represent the training / test dataset well.
 "The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective"
 However, either pruning the network (setting the weights to zero) before training, or reinitializing the weights in the trained network from the initialization distribution does not work.
 "Dense, randomlyinitialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket"
 The blessing of dimensionality!

 Complementary with dropout, at least the iterative pruning regime.
 But only with a slow learning rate (?) or learning rate warmup for deeper nets.
 Very complete appendix, as neccessitated by the submission to ICLR. Within it there is a little The truth wears off effect (or: more caves of complexity)
Stabilizing the lottery ticket hypothesis
 With deeper neural networks, you can't prune away most of the weights before at least some training has occurred.
 Instead, train the network partly, then do iterative magnitude pruning to select important weights.
 Even with early training, this works well up to 80% sparsity on Imagenet.
 Given the previous results, this doesn't seem so surprising..
OpenAI Deep Double Descent
 Original phenomena discovered in Reconciling modern machine learning practice and the biasvariance tradeoff
 Why is bigger always better?
 Another wellwritten and easily understood post.
 At the interpolation threshold, there are relatively few models that fit the training data well, and label noise can easily mess up their global structure; beyond this interpolation threshold, there are many good models, and SGD somehow has implicit bias (??) to select models that are parsimonious and hence generalize well.
 This despite the fact that classical statistics would suggest that the models are very overparameterized.
 Maybe it's the noise (the S in SGD) which acts as a regularizer? That plus the fact that the networks imperfectly represent structure in the data?
 When there is nearzero training error, what does SGD do ??
 Understanding deep double descent
 Quote: but it still leaves what is in my opinion the most important question unanswered, which is: what exactly are the magical inductive biases of modern ML that make interpolation work so well?
 Alternate hypothesis, from lesser wrong: ensembling improves generalization. "Which is something we've known for a long time".
 the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occamâ€™s Razor works...
 Thatâ€™s exactly correct. You can prove it via the Laplace approximation: the â€œwidthâ€ of the peak in each principal direction is the inverse of an eigenvalue of the Hessian, and each eigenvalue $\lambda_i$ contributes $\frac{ 1}{ 2}log(\lambda_i)$ to the marginal log likelihood $log P[datamodel]$ . So, if a peak is twice as wide in one direction, its marginal log likelihood is higher by $\frac{ 1}{ 2}log(2)$ , or half a bit. For models in which the number of free parameters is large relative to the number of data points (i.e. the interesting part of the doubledescent curve), this is the main term of interest in the marginal log likelihood.
 Ensembling does not explain the lottery ticket hypothesis.
 Critical learning periods in deep neural networks
 Per above, it also does not explain this result  that the trace of the Fisher Information Matrix goes up then down with training; the SGD consolidates the weights so that 'fewer matter'.

 FIM, reminding myself: the expected value [ of the derivative [ of the loglikelihood function, f(data; parameters)]] , which is all a function of the parameters.
 Expected value is taken over the data.
 Derivative is with respect to the parameters. partial derivative = score; high score = data has a high local dependence on parameters, or equivalently, the parameters should be easier to estimate.
 loglikelihood because that's the way it is; or: probabilities are best understood in decibels.
 Understanding deeplearning requires rethinking generalization
 Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals
 stateoftheart convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data
 95.2% accuracy is still very surprising for a million random labels from 1000 categories.
 Training time increases by a small scalar factor with random labels.
 Regularization via weight decay, dropout, and data augmentation do not eliminate fitting of random labels.
 Works even when the true images are replaced by random noise too.
 Depth two neural networks have perfect sample expressivity as soon as parameters > data points.
 The second bit of meat in the paper is section 5, Implicit regularization: an appeal to linear models.
 Have n data points ${x_i,y_i}$ where $x_i$ are ddimensional feature vectors, and $y_i$ are the labels.
 if we want to solve the fitting problem, $min_{w^T \elem R^d} \Sigma_{i=1}^{n} loss(w^T x_i,y_i)$  this is just linear regression, and if d > n, can fit exactly.
 The hessian of this function is degenerate  the curvature is meaningless, and does not inform generalization.
 With SGD, $w_{t+1} = w_t  \eta e_t x_{i_t}$ where $e_t$ is the prediction error.
 If we start at w=0, $w = \Sigma_{i=1}^{n} \alpha_i x_i$ for some coefficients $\alpha$ .
 Hence, $w = X^T \alpha$  the weights are in the span of the data points.
 If we interpolate perfectly, then $X w = y$
 Substitute, and get $X X^T \alpha = y$
 This is the "kernel trick" (Scholkopf et al 2001)
 Depends only on all the dotproducts between all the datapoints  it's a n*n linear system that can be solved exactly for small sets. (not pseudoinverse!)
 On mnist, this results in a 1.2% test error (!)
 With gabor wavelet preprocessing, the the error is 0.6% !
 Out of all models, SGD will converge to the model with the minimum norm (without weight decay)
 Norm is only a small part of the generalization puzzle.
Identifying and attacking the saddle point problem in highdimensional nonconvex optimization
Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited
Random deep neural networks are biased towards simple functions
Reconciling modern machine learning practice and the biasvariance tradeoff

Title: Error signals in the cortex and basal ganglia.
Abstract: Numerous studies have found correlations between measures of neural activity, from single unit recordings to aggregate measures such as EEG, to motor behavior. Two general themes have emerged from this research: neurons are generally broadly tuned and are often arrayed in spatial maps. It is hypothesized that these are two features of a larger hierarchal structure of spatial and temporal transforms that allow mappings to procure complex behaviors from abstract goals, or similarly, complex sensory information to produce simple percepts. Much theoretical work has proved the suitability of this organization to both generate behavior and extract relevant information from the world. It is generally agreed that most transforms enacted by the cortex and basal ganglia are learned rather than genetically encoded. Therefore, it is the characterization of the learning process that describes the computational nature of the brain; the descriptions of the basis functions themselves are more descriptive of the brainâ€™s environment. Here we hypothesize that learning in the mammalian brain is a stochastic maximization of reward and transform predictability, and a minimization of transform complexity and latency. It is probable that the optimizations employed in learning include both components of gradient descent and competitive elimination, which are two large classes of algorithms explored extensively in the field of machine learning. The former method requires the existence of a vectoral error signal, while the latter is less restrictive, and requires at least a scalar evaluator. We will look for the existence of candidate error or evaluator signals in the cortex and basal ganglia during forcefield learning where the motor error is taskrelevant and explicitly provided to the subject. By simultaneously recording large populations of neurons from multiple brain areas we can probe the existence of error or evaluator signals by measuring the stochastic relationship and predictive ability of neural activity to the provided error signal. From this data we will also be able to track dependence of neural tuning trajectory on trialbytrial success; if the cortex operates under minimization principles, then tuning change will have a temporal relationship to reward. The overarching goal of this research is to look for one aspect of motor learning â€“ the error signal â€“ with the hope of using this data to better understand the normal function of the cortex and basal ganglia, and how this normal function is related to the symptoms caused by disease and lesions of the brain. 