{1525} revision 6 modified: 10-26-2020 03:23 gmt

Why deep learning works even though it shouldn't, instigated a fun thread thinking about "complexity of model" vs "complexity of solution".

  • The blog post starts from the position that modern deep learning should not work because the models are much too complex for the datasets they are trained on -- they should not generalize well.
    • Quote" why do models get better when they are bigger and deeper, even when the amount of data they consume stays the same or gets smaller."
  • Argument: in high-dimensional spaces, all solutions are about the same distance from each other. This means that high dimensional spaces are very well connected. (Seems hand-wavy?)
    • Sub-argument: with bilions of dimensions, it is exponentially unlikely that all gradients will be positive, e.g. you are in a local minimum. Much more likely that about half are positive, half are negative -> saddle.
    • This is of course looking at it in terms of gradient descent, which is not probably how biological systems build complexity. See also the saddle paper.
  • Claim: Early stopping is better regularization than any hand-picked a priori regularization, including implicit regularization like model size.
    • Well, maybe; stopping early of course is normally thought to prevent over-fitting or over-memorization of the dataset; but see also Double Descent, below.
    • Also: "that weight distributions are highly non-independent even after only a few hundred iterations" abstract of The Early phase of Neural Network Training
    • Or: "We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. "
  • Claim: SGD, ADAM, etc does not train to a minimum.
    • I think this is broadly supportable via the high-dimensional saddle argument.
    • He relates this to distillation: a large model can infer 'good structure', possibly via the good luck of having a very large parameter space; a small model can learn these features with fewer parameters, and hopefully there will be less 'nuisance' dimensions in the distilled data.
  • discussion on Hacker News

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

  • This paper, well written & insightful.
  • Core idea: train up a network, not necessarily to completion or zero test error.
  • Prune away the smallest ~90% of the weights. Pruning is not at all a new idea.
    • For larger networks, they propose iterative pruning: train for a while, prune away connections that don't matter, continue.
      • Does this sound like human neural development? Yes!
  • Re-start training from the initial weights, with most of the network pruned away. This network will train up faster, to equivalent accuracy, compared to orinial full network.
  • This seems to work well for MNIST and CIFAR10.
  • From this, they hypothesize that within a large network there is a 'lottery ticket' sub-network that can be trained well to represent the training / test dataset well.
    • "The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective"
  • However, either pruning the network (setting the weights to zero) before training, or re-initializing the weights in the trained network from the initialization distribution does not work.
  • "Dense, randomly-initialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket"
    • The blessing of dimensionality!
  • Complementary with dropout, at least the iterative pruning regime.
  • But only with a slow learning rate (?) or learning rate warmup for deeper nets.
  • Very complete appendix, as neccessitated by the submission to ICLR. Within it there is a little The truth wears off effect (or: more caves of complexity)

Stabilizing the lottery ticket hypothesis

  • With deeper neural networks, you can't prune away most of the weights before at least some training has occurred.
  • Instead, train the network partly, then do iterative magnitude pruning to select important weights.
  • Even with early training, this works well up to 80% sparsity on Imagenet.
  • Given the previous results, this doesn't seem so surprising..

OpenAI Deep Double Descent

  • Original phenomena discovered in Reconciling modern machine learning practice and the bias-variance trade-off
  • Why is bigger always better?
  • Another well-written and easily understood post.
  • At the interpolation threshold, there are relatively few models that fit the training data well, and label noise can easily mess up their global structure; beyond this interpolation threshold, there are many good models, and SGD somehow has implicit bias (??) to select models that are parsimonious and hence generalize well.
    • This despite the fact that classical statistics would suggest that the models are very over-parameterized.
    • Maybe it's the noise (the S in SGD) which acts as a regularizer? That plus the fact that the networks imperfectly represent structure in the data?
      • When there is near-zero training error, what does SGD do ??
  • Understanding deep double descent
    • Quote: but it still leaves what is in my opinion the most important question unanswered, which is: what exactly are the magical inductive biases of modern ML that make interpolation work so well?
  • Alternate hypothesis, from lesser wrong: ensembling improves generalization. "Which is something we've known for a long time".
    • the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occam’s Razor works...
      • That’s exactly correct. You can prove it via the Laplace approximation: the “width” of the peak in each principal direction is the inverse of an eigenvalue of the Hessian, and each eigenvalue λ i\lambda_i contributes 12log(λ i)-\frac{ 1}{ 2}log(\lambda_i) to the marginal log likelihood logP[data|model]log P[data|model] . So, if a peak is twice as wide in one direction, its marginal log likelihood is higher by 12log(2)\frac{ 1}{ 2}log(2) , or half a bit. For models in which the number of free parameters is large relative to the number of data points (i.e. the interesting part of the double-descent curve), this is the main term of interest in the marginal log likelihood.
      • Ensembling does not explain the lottery ticket hypothesis.

  • Critical learning periods in deep neural networks
    • Per above, it also does not explain this result -- that the trace of the Fisher Information Matrix goes up then down with training; the SGD consolidates the weights so that 'fewer matter'.
    • FIM, reminding myself: the expected value [ of the derivative [ of the log-likelihood function, f(data; parameters)]] , which is all a function of the parameters.
      • Expected value is taken over the data.
      • Derivative is with respect to the parameters. partial derivative = score; high score = data has a high local dependence on parameters, or equivalently, the parameters should be easier to estimate.
      • log-likelihood because that's the way it is; or: probabilities are best understood in decibels.

  • Understanding deep-learning requires re-thinking generalization
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals
  • state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data
    • 95.2% accuracy is still very surprising for a million random labels from 1000 categories.
    • Training time increases by a small scalar factor with random labels.
    • Regularization via weight decay, dropout, and data augmentation do not eliminate fitting of random labels.
  • Works even when the true images are replaced by random noise too.
  • Depth two neural networks have perfect sample expressivity as soon as parameters > data points.
  • The second bit of meat in the paper is section 5, Implicit regularization: an appeal to linear models.
  • Have n data points x i,y i{x_i,y_i} where x ix_i are d-dimensional feature vectors, and y iy_i are the labels.
    • if we want to solve the fitting problem, min w TelemR dΣ i=1 nloss(w Tx i,y i) min_{w^T \elem R^d} \Sigma_{i=1}^{n} loss(w^T x_i,y_i) -- this is just linear regression, and if d > n, can fit exactly.
    • The hessian of this function is degenerate -- the curvature is meaningless, and does not inform generalization.
  • With SGD, w t+1=w tηe tx i tw_{t+1} = w_t - \eta e_t x_{i_t} where e te_t is the prediction error.
  • If we start at w=0, w=Σ i=1 nα ix iw = \Sigma_{i=1}^{n} \alpha_i x_i for some coefficients α\alpha .
  • Hence, w=X Tαw = X^T \alpha -- the weights are in the span of the data points.
  • If we interpolate perfectly, then Xw=y X w = y
  • Substitute, and get XX Tα=y X X^T \alpha = y
    • This is the "kernel trick" (Scholkopf et al 2001)
    • Depends only on all the dot-products between all the datapoints -- it's a n*n linear system that can be solved exactly for small sets. (not pseudo-inverse!)
    • On mnist, this results in a 1.2% test error (!)
    • With gabor wavelet pre-processing, the the error is 0.6% !
  • Out of all models, SGD will converge to the model with the minimum norm (without weight decay)
    • Norm is only a small part of the generalization puzzle.

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

Random deep neural networks are biased towards simple functions

Reconciling modern machine learning practice and the bias-variance trade-off