use https for features.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -0 tags: double descent complexity construction gradient descent date: 10-26-2020 03:23 gmt revision:6 [5] [4] [3] [2] [1] [0] [head]

Why deep learning works even though it shouldn't, instigated a fun thread thinking about "complexity of model" vs "complexity of solution".

  • The blog post starts from the position that modern deep learning should not work because the models are much too complex for the datasets they are trained on -- they should not generalize well.
    • Quote" why do models get better when they are bigger and deeper, even when the amount of data they consume stays the same or gets smaller."
  • Argument: in high-dimensional spaces, all solutions are about the same distance from each other. This means that high dimensional spaces are very well connected. (Seems hand-wavy?)
    • Sub-argument: with bilions of dimensions, it is exponentially unlikely that all gradients will be positive, e.g. you are in a local minimum. Much more likely that about half are positive, half are negative -> saddle.
    • This is of course looking at it in terms of gradient descent, which is not probably how biological systems build complexity. See also the saddle paper.
  • Claim: Early stopping is better regularization than any hand-picked a priori regularization, including implicit regularization like model size.
    • Well, maybe; stopping early of course is normally thought to prevent over-fitting or over-memorization of the dataset; but see also Double Descent, below.
    • Also: "that weight distributions are highly non-independent even after only a few hundred iterations" abstract of The Early phase of Neural Network Training
    • Or: "We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. "
  • Claim: SGD, ADAM, etc does not train to a minimum.
    • I think this is broadly supportable via the high-dimensional saddle argument.
    • He relates this to distillation: a large model can infer 'good structure', possibly via the good luck of having a very large parameter space; a small model can learn these features with fewer parameters, and hopefully there will be less 'nuisance' dimensions in the distilled data.
  • discussion on Hacker News

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

  • This paper, well written & insightful.
  • Core idea: train up a network, not necessarily to completion or zero test error.
  • Prune away the smallest ~90% of the weights. Pruning is not at all a new idea.
    • For larger networks, they propose iterative pruning: train for a while, prune away connections that don't matter, continue.
      • Does this sound like human neural development? Yes!
  • Re-start training from the initial weights, with most of the network pruned away. This network will train up faster, to equivalent accuracy, compared to orinial full network.
  • This seems to work well for MNIST and CIFAR10.
  • From this, they hypothesize that within a large network there is a 'lottery ticket' sub-network that can be trained well to represent the training / test dataset well.
    • "The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective"
  • However, either pruning the network (setting the weights to zero) before training, or re-initializing the weights in the trained network from the initialization distribution does not work.
  • "Dense, randomly-initialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket"
    • The blessing of dimensionality!
  • Complementary with dropout, at least the iterative pruning regime.
  • But only with a slow learning rate (?) or learning rate warmup for deeper nets.
  • Very complete appendix, as neccessitated by the submission to ICLR. Within it there is a little The truth wears off effect (or: more caves of complexity)

Stabilizing the lottery ticket hypothesis

  • With deeper neural networks, you can't prune away most of the weights before at least some training has occurred.
  • Instead, train the network partly, then do iterative magnitude pruning to select important weights.
  • Even with early training, this works well up to 80% sparsity on Imagenet.
  • Given the previous results, this doesn't seem so surprising..

OpenAI Deep Double Descent

  • Original phenomena discovered in Reconciling modern machine learning practice and the bias-variance trade-off
  • Why is bigger always better?
  • Another well-written and easily understood post.
  • At the interpolation threshold, there are relatively few models that fit the training data well, and label noise can easily mess up their global structure; beyond this interpolation threshold, there are many good models, and SGD somehow has implicit bias (??) to select models that are parsimonious and hence generalize well.
    • This despite the fact that classical statistics would suggest that the models are very over-parameterized.
    • Maybe it's the noise (the S in SGD) which acts as a regularizer? That plus the fact that the networks imperfectly represent structure in the data?
      • When there is near-zero training error, what does SGD do ??
  • Understanding deep double descent
    • Quote: but it still leaves what is in my opinion the most important question unanswered, which is: what exactly are the magical inductive biases of modern ML that make interpolation work so well?
  • Alternate hypothesis, from lesser wrong: ensembling improves generalization. "Which is something we've known for a long time".
    • the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occam’s Razor works...
      • That’s exactly correct. You can prove it via the Laplace approximation: the “width” of the peak in each principal direction is the inverse of an eigenvalue of the Hessian, and each eigenvalue λ i\lambda_i contributes 12log(λ i)-\frac{ 1}{ 2}log(\lambda_i) to the marginal log likelihood logP[data|model]log P[data|model] . So, if a peak is twice as wide in one direction, its marginal log likelihood is higher by 12log(2)\frac{ 1}{ 2}log(2) , or half a bit. For models in which the number of free parameters is large relative to the number of data points (i.e. the interesting part of the double-descent curve), this is the main term of interest in the marginal log likelihood.
      • Ensembling does not explain the lottery ticket hypothesis.

  • Critical learning periods in deep neural networks
    • Per above, it also does not explain this result -- that the trace of the Fisher Information Matrix goes up then down with training; the SGD consolidates the weights so that 'fewer matter'.
    • FIM, reminding myself: the expected value [ of the derivative [ of the log-likelihood function, f(data; parameters)]] , which is all a function of the parameters.
      • Expected value is taken over the data.
      • Derivative is with respect to the parameters. partial derivative = score; high score = data has a high local dependence on parameters, or equivalently, the parameters should be easier to estimate.
      • log-likelihood because that's the way it is; or: probabilities are best understood in decibels.

  • Understanding deep-learning requires re-thinking generalization
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals
  • state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data
    • 95.2% accuracy is still very surprising for a million random labels from 1000 categories.
    • Training time increases by a small scalar factor with random labels.
    • Regularization via weight decay, dropout, and data augmentation do not eliminate fitting of random labels.
  • Works even when the true images are replaced by random noise too.
  • Depth two neural networks have perfect sample expressivity as soon as parameters > data points.
  • The second bit of meat in the paper is section 5, Implicit regularization: an appeal to linear models.
  • Have n data points x i,y i{x_i,y_i} where x ix_i are d-dimensional feature vectors, and y iy_i are the labels.
    • if we want to solve the fitting problem, min w TelemR dΣ i=1 nloss(w Tx i,y i) min_{w^T \elem R^d} \Sigma_{i=1}^{n} loss(w^T x_i,y_i) -- this is just linear regression, and if d > n, can fit exactly.
    • The hessian of this function is degenerate -- the curvature is meaningless, and does not inform generalization.
  • With SGD, w t+1=w tηe tx i tw_{t+1} = w_t - \eta e_t x_{i_t} where e te_t is the prediction error.
  • If we start at w=0, w=Σ i=1 nα ix iw = \Sigma_{i=1}^{n} \alpha_i x_i for some coefficients α\alpha .
  • Hence, w=X Tαw = X^T \alpha -- the weights are in the span of the data points.
  • If we interpolate perfectly, then Xw=y X w = y
  • Substitute, and get XX Tα=y X X^T \alpha = y
    • This is the "kernel trick" (Scholkopf et al 2001)
    • Depends only on all the dot-products between all the datapoints -- it's a n*n linear system that can be solved exactly for small sets. (not pseudo-inverse!)
    • On mnist, this results in a 1.2% test error (!)
    • With gabor wavelet pre-processing, the the error is 0.6% !
  • Out of all models, SGD will converge to the model with the minimum norm (without weight decay)
    • Norm is only a small part of the generalization puzzle.

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

Random deep neural networks are biased towards simple functions

Reconciling modern machine learning practice and the bias-variance trade-off

hide / / print
ref: -2005 tags: dimensionality reduction contrastive gradient descent date: 09-13-2020 02:49 gmt revision:2 [1] [0] [head]

Dimensionality reduction by learning and invariant mapping

  • Raia Hadsell, Sumit Chopra, Yann LeCun
  • Central idea: learn and invariant mapping of the input by minimizing mapped distance (e.g. the distance between outputs) when the samples are categorized as the same (same numbers in MNIST eg), and maximizing mapped distance when the samples are categorized as distant.
    • Two loss functions for same vs different.
  • This is an attraction-repulsion spring analogy.
  • Use gradient descent to change the weights to satisfy these two competing losses.
  • Resulting constitutional neural nets can extract camera pose information from the NORB dataset.
  • Surprising how simple analogies like this, when iterated across a great many samples, pull out intuitively correct invariances.

hide / / print
ref: -0 tags: perl directory descent script remove date: 01-10-2014 06:12 gmt revision:0 [head]

Simple perl scrip for removing duplicate files within sub-directories of a known depth:

#!/usr/bin/perl -w

@files = <*>;
foreach $file (@files) {
	@files2 = <$file/*>;
	foreach $file2 (@files2) {
		print $file2 . "\n";
		`rm -rf $file2/*_1.jpg`; 
		`rm -rf $file2/*_2.jpg`; 

hide / / print
ref: abstract-0 tags: tlh24 error signals in the cortex and basal ganglia reinforcement_learning gradient_descent motor_learning date: 0-0-2006 0:0 revision:0 [head]

Title: Error signals in the cortex and basal ganglia.

Abstract: Numerous studies have found correlations between measures of neural activity, from single unit recordings to aggregate measures such as EEG, to motor behavior. Two general themes have emerged from this research: neurons are generally broadly tuned and are often arrayed in spatial maps. It is hypothesized that these are two features of a larger hierarchal structure of spatial and temporal transforms that allow mappings to procure complex behaviors from abstract goals, or similarly, complex sensory information to produce simple percepts. Much theoretical work has proved the suitability of this organization to both generate behavior and extract relevant information from the world. It is generally agreed that most transforms enacted by the cortex and basal ganglia are learned rather than genetically encoded. Therefore, it is the characterization of the learning process that describes the computational nature of the brain; the descriptions of the basis functions themselves are more descriptive of the brain’s environment. Here we hypothesize that learning in the mammalian brain is a stochastic maximization of reward and transform predictability, and a minimization of transform complexity and latency. It is probable that the optimizations employed in learning include both components of gradient descent and competitive elimination, which are two large classes of algorithms explored extensively in the field of machine learning. The former method requires the existence of a vectoral error signal, while the latter is less restrictive, and requires at least a scalar evaluator. We will look for the existence of candidate error or evaluator signals in the cortex and basal ganglia during force-field learning where the motor error is task-relevant and explicitly provided to the subject. By simultaneously recording large populations of neurons from multiple brain areas we can probe the existence of error or evaluator signals by measuring the stochastic relationship and predictive ability of neural activity to the provided error signal. From this data we will also be able to track dependence of neural tuning trajectory on trial-by-trial success; if the cortex operates under minimization principles, then tuning change will have a temporal relationship to reward. The overarching goal of this research is to look for one aspect of motor learning – the error signal – with the hope of using this data to better understand the normal function of the cortex and basal ganglia, and how this normal function is related to the symptoms caused by disease and lesions of the brain.