Why deep learning works even though it shouldn't, instigated a fun thread thinking about "complexity of model" vs "complexity of solution".
 The blog post starts from the position that modern deep learning should not work because the models are much too complex for the datasets they are trained on  they should not generalize well.
 Quote" why do models get better when they are bigger and deeper, even when the amount of data they consume stays the same or gets smaller."
 Argument: in highdimensional spaces, all solutions are about the same distance from each other. This means that high dimensional spaces are very well connected. (Seems handwavy?)
 Subargument: with bilions of dimensions, it is exponentially unlikely that all gradients will be positive, e.g. you are in a local minimum. Much more likely that about half are positive, half are negative > saddle.
 This is of course looking at it in terms of gradient descent, which is not probably how biological systems build complexity. See also the saddle paper.
 Claim: Early stopping is better regularization than any handpicked a priori regularization, including implicit regularization like model size.
 Well, maybe; stopping early of course is normally thought to prevent overfitting or overmemorization of the dataset; but see also Double Descent, below.
 Also: "that weight distributions are highly nonindependent even after only a few hundred iterations" abstract of The Early phase of Neural Network Training
 Or: "We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. "
 Claim: SGD, ADAM, etc does not train to a minimum.
 I think this is broadly supportable via the highdimensional saddle argument.
 He relates this to distillation: a large model can infer 'good structure', possibly via the good luck of having a very large parameter space; a small model can learn these features with fewer parameters, and hopefully there will be less 'nuisance' dimensions in the distilled data.
 discussion on Hacker News
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
 This paper, well written & insightful.
 Core idea: train up a network, not necessarily to completion or zero test error.
 Prune away the smallest ~90% of the weights. Pruning is not at all a new idea.
 For larger networks, they propose iterative pruning: train for a while, prune away connections that don't matter, continue.
 Does this sound like human neural development? Yes!
 Restart training from the initial weights, with most of the network pruned away. This network will train up faster, to equivalent accuracy, compared to orinial full network.
 This seems to work well for MNIST and CIFAR10.
 From this, they hypothesize that within a large network there is a 'lottery ticket' subnetwork that can be trained well to represent the training / test dataset well.
 "The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective"
 However, either pruning the network (setting the weights to zero) before training, or reinitializing the weights in the trained network from the initialization distribution does not work.
 "Dense, randomlyinitialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket"
 The blessing of dimensionality!

 Complementary with dropout, at least the iterative pruning regime.
 But only with a slow learning rate (?) or learning rate warmup for deeper nets.
 Very complete appendix, as neccessitated by the submission to ICLR. Within it there is a little The truth wears off effect (or: more caves of complexity)
Stabilizing the lottery ticket hypothesis
 With deeper neural networks, you can't prune away most of the weights before at least some training has occurred.
 Instead, train the network partly, then do iterative magnitude pruning to select important weights.
 Even with early training, this works well up to 80% sparsity on Imagenet.
 Given the previous results, this doesn't seem so surprising..
OpenAI Deep Double Descent
 Original phenomena discovered in Reconciling modern machine learning practice and the biasvariance tradeoff
 Why is bigger always better?
 Another wellwritten and easily understood post.
 At the interpolation threshold, there are relatively few models that fit the training data well, and label noise can easily mess up their global structure; beyond this interpolation threshold, there are many good models, and SGD somehow has implicit bias (??) to select models that are parsimonious and hence generalize well.
 This despite the fact that classical statistics would suggest that the models are very overparameterized.
 Maybe it's the noise (the S in SGD) which acts as a regularizer? That plus the fact that the networks imperfectly represent structure in the data?
 When there is nearzero training error, what does SGD do ??
 Understanding deep double descent
 Quote: but it still leaves what is in my opinion the most important question unanswered, which is: what exactly are the magical inductive biases of modern ML that make interpolation work so well?
 Alternate hypothesis, from lesser wrong: ensembling improves generalization. "Which is something we've known for a long time".
 the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occam’s Razor works...
 That’s exactly correct. You can prove it via the Laplace approximation: the “width” of the peak in each principal direction is the inverse of an eigenvalue of the Hessian, and each eigenvalue $\lambda_i$ contributes $\frac{ 1}{ 2}log(\lambda_i)$ to the marginal log likelihood $log P[datamodel]$ . So, if a peak is twice as wide in one direction, its marginal log likelihood is higher by $\frac{ 1}{ 2}log(2)$ , or half a bit. For models in which the number of free parameters is large relative to the number of data points (i.e. the interesting part of the doubledescent curve), this is the main term of interest in the marginal log likelihood.
 Ensembling does not explain the lottery ticket hypothesis.
 Critical learning periods in deep neural networks
 Per above, it also does not explain this result  that the trace of the Fisher Information Matrix goes up then down with training; the SGD consolidates the weights so that 'fewer matter'.

 FIM, reminding myself: the expected value [ of the derivative [ of the loglikelihood function, f(data; parameters)]] , which is all a function of the parameters.
 Expected value is taken over the data.
 Derivative is with respect to the parameters. partial derivative = score; high score = data has a high local dependence on parameters, or equivalently, the parameters should be easier to estimate.
 loglikelihood because that's the way it is; or: probabilities are best understood in decibels.
 Understanding deeplearning requires rethinking generalization
 Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals
 stateoftheart convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data
 95.2% accuracy is still very surprising for a million random labels from 1000 categories.
 Training time increases by a small scalar factor with random labels.
 Regularization via weight decay, dropout, and data augmentation do not eliminate fitting of random labels.
 Works even when the true images are replaced by random noise too.
 Depth two neural networks have perfect sample expressivity as soon as parameters > data points.
 The second bit of meat in the paper is section 5, Implicit regularization: an appeal to linear models.
 Have n data points ${x_i,y_i}$ where $x_i$ are ddimensional feature vectors, and $y_i$ are the labels.
 if we want to solve the fitting problem, $min_{w^T \elem R^d} \Sigma_{i=1}^{n} loss(w^T x_i,y_i)$  this is just linear regression, and if d > n, can fit exactly.
 The hessian of this function is degenerate  the curvature is meaningless, and does not inform generalization.
 With SGD, $w_{t+1} = w_t  \eta e_t x_{i_t}$ where $e_t$ is the prediction error.
 If we start at w=0, $w = \Sigma_{i=1}^{n} \alpha_i x_i$ for some coefficients $\alpha$ .
 Hence, $w = X^T \alpha$  the weights are in the span of the data points.
 If we interpolate perfectly, then $X w = y$
 Substitute, and get $X X^T \alpha = y$
 This is the "kernel trick" (Scholkopf et al 2001)
 Depends only on all the dotproducts between all the datapoints  it's a n*n linear system that can be solved exactly for small sets. (not pseudoinverse!)
 On mnist, this results in a 1.2% test error (!)
 With gabor wavelet preprocessing, the the error is 0.6% !
 Out of all models, SGD will converge to the model with the minimum norm (without weight decay)
 Norm is only a small part of the generalization puzzle.
Identifying and attacking the saddle point problem in highdimensional nonconvex optimization
Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited
Random deep neural networks are biased towards simple functions
Reconciling modern machine learning practice and the biasvariance tradeoff
