Symmetry and simplicity spontaneously emerge from the algorithmic nature of evolution
 Central hypothesis is that simplicity and symmetry arrive not through natural selection, but because these form are overwhelmingly represented in the genotypephenotype map
 Experimental example here was "polyominoes", where there are N=16 tiles, each with a 4 numbers (encoded with e.g. 6bit binary numbers). The edge numbers determine how the tiles irreversibly bind, e.g. 1 <> 2, 3 <> 4 etc, with 4 and 2^61 binding to nothing.
 These tiles are allowed to 'randomly' selfassemble. Some don't terminate (e.g. they form continuous polymers); these are discarded; others do terminate (no more available binding sites).
 They assessed the complexity of both polyominoes selected for a particular size, eg 16 tiles, or those not selected at all, other than terminating.
 In both complexity was assessed based on how many actual interactions were needed to make the observed structure. That is, they removed tile edge numbers and kept it if it affected the nmer formation.
 Result was this nice loglog plot:

 Showed that this same trend holds for proteinprotein complexes (weaker result, imho)
 As well as RNA secondary structure
 And metabolic timeseries in a ODE modeled on yeast metabolism (even weaker result..)
The paper features a excellent set of references, including:
Letter to a friend following her article Machine learning in evolutionary studies comes of age
Read your PNAS article last night, super interesting that you can get statistical purchase on longlost evolutionary 'sweeps' via GANs and other neural network models. I feel like there is some sort of statistical power issue there? DNNs are almost always overparameterized... slightly suspicious.
This morning I was sleepily mulling things over & thought about a walking conversation that we had a long time ago in the woods of NC: Why is evolution so effective? Why does it seem to evolve to evolve? Thinking more  and having years more perspective  it seems almost obvious in retrospect: it's a consequence of Bayes' rule. Evolution finds solutions in spaces that have overwhelming prevalence of working solutions. The prior has an extremely strong effect. These representational / structural spaces by definition have many nearby & associated solutions, hence appear posthoc 'evolvable'. (You probably already know this.)
I think proteins very much fall into this category: AA were added to the translation machinery based on ones that happened to solve a particular problem... but because of the 'generalization prior' (to use NN parlance), they were useful for many other things. This does not explain the humanengineeringlike modularity of mature evolved systems, but maybe that is due to the strong simplicity prior [1]
Very very interesting to me is how the science of evolution and neural networks are drawing together, vis a vis the lottery ticket hypothesis. Both evince a continuum of representational spaces, too, from highdimensional vectoral (how all modern deep learning systems work) to lowdimensional modular, specific, and general (phenomenological human cognition). I suspect that evolution uses a form of this continuum, as seen in the human highdimensional longrange gene regulatory / enhancer network (= a structure designed to evolve). Not sure how selection works here, though; it's hard to search a highdimensional space. The brain has an almost identical problem: it's hard to do 'credit assignment' in a billionslarge, deep and recurrent network. Finding which set of synapses caused a good / bad behaviior takes a lot of bits. 
Why deep learning works even though it shouldn't, instigated a fun thread thinking about "complexity of model" vs "complexity of solution".
 The blog post starts from the position that modern deep learning should not work because the models are much too complex for the datasets they are trained on  they should not generalize well.
 Quote" why do models get better when they are bigger and deeper, even when the amount of data they consume stays the same or gets smaller."
 Argument: in highdimensional spaces, all solutions are about the same distance from each other. This means that high dimensional spaces are very well connected. (Seems handwavy?)
 Subargument: with bilions of dimensions, it is exponentially unlikely that all gradients will be positive, e.g. you are in a local minimum. Much more likely that about half are positive, half are negative > saddle.
 This is of course looking at it in terms of gradient descent, which is not probably how biological systems build complexity. See also the saddle paper.
 Claim: Early stopping is better regularization than any handpicked a priori regularization, including implicit regularization like model size.
 Well, maybe; stopping early of course is normally thought to prevent overfitting or overmemorization of the dataset; but see also Double Descent, below.
 Also: "that weight distributions are highly nonindependent even after only a few hundred iterations" abstract of The Early phase of Neural Network Training
 Or: "We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. "
 Claim: SGD, ADAM, etc does not train to a minimum.
 I think this is broadly supportable via the highdimensional saddle argument.
 He relates this to distillation: a large model can infer 'good structure', possibly via the good luck of having a very large parameter space; a small model can learn these features with fewer parameters, and hopefully there will be less 'nuisance' dimensions in the distilled data.
 discussion on Hacker News
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
 This paper, well written & insightful.
 Core idea: train up a network, not necessarily to completion or zero test error.
 Prune away the smallest ~90% of the weights. Pruning is not at all a new idea.
 For larger networks, they propose iterative pruning: train for a while, prune away connections that don't matter, continue.
 Does this sound like human neural development? Yes!
 Restart training from the initial weights, with most of the network pruned away. This network will train up faster, to equivalent accuracy, compared to orinial full network.
 This seems to work well for MNIST and CIFAR10.
 From this, they hypothesize that within a large network there is a 'lottery ticket' subnetwork that can be trained well to represent the training / test dataset well.
 "The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective"
 However, either pruning the network (setting the weights to zero) before training, or reinitializing the weights in the trained network from the initialization distribution does not work.
 "Dense, randomlyinitialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket"
 The blessing of dimensionality!

 Complementary with dropout, at least the iterative pruning regime.
 But only with a slow learning rate (?) or learning rate warmup for deeper nets.
 Very complete appendix, as neccessitated by the submission to ICLR. Within it there is a little The truth wears off effect (or: more caves of complexity)
Stabilizing the lottery ticket hypothesis
 With deeper neural networks, you can't prune away most of the weights before at least some training has occurred.
 Instead, train the network partly, then do iterative magnitude pruning to select important weights.
 Even with early training, this works well up to 80% sparsity on Imagenet.
 Given the previous results, this doesn't seem so surprising..
OpenAI Deep Double Descent
 Original phenomena discovered in Reconciling modern machine learning practice and the biasvariance tradeoff
 Why is bigger always better?
 Another wellwritten and easily understood post.
 At the interpolation threshold, there are relatively few models that fit the training data well, and label noise can easily mess up their global structure; beyond this interpolation threshold, there are many good models, and SGD somehow has implicit bias (??) to select models that are parsimonious and hence generalize well.
 This despite the fact that classical statistics would suggest that the models are very overparameterized.
 Maybe it's the noise (the S in SGD) which acts as a regularizer? That plus the fact that the networks imperfectly represent structure in the data?
 When there is nearzero training error, what does SGD do ??
 Understanding deep double descent
 Quote: but it still leaves what is in my opinion the most important question unanswered, which is: what exactly are the magical inductive biases of modern ML that make interpolation work so well?
 Alternate hypothesis, from lesser wrong: ensembling improves generalization. "Which is something we've known for a long time".
 the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occamâ€™s Razor works...
 Thatâ€™s exactly correct. You can prove it via the Laplace approximation: the â€œwidthâ€ of the peak in each principal direction is the inverse of an eigenvalue of the Hessian, and each eigenvalue $\lambda_i$ contributes $\frac{ 1}{ 2}log(\lambda_i)$ to the marginal log likelihood $log P[datamodel]$ . So, if a peak is twice as wide in one direction, its marginal log likelihood is higher by $\frac{ 1}{ 2}log(2)$ , or half a bit. For models in which the number of free parameters is large relative to the number of data points (i.e. the interesting part of the doubledescent curve), this is the main term of interest in the marginal log likelihood.
 Ensembling does not explain the lottery ticket hypothesis.
 Critical learning periods in deep neural networks
 Per above, it also does not explain this result  that the trace of the Fisher Information Matrix goes up then down with training; the SGD consolidates the weights so that 'fewer matter'.

 FIM, reminding myself: the expected value [ of the derivative [ of the loglikelihood function, f(data; parameters)]] , which is all a function of the parameters.
 Expected value is taken over the data.
 Derivative is with respect to the parameters. partial derivative = score; high score = data has a high local dependence on parameters, or equivalently, the parameters should be easier to estimate.
 loglikelihood because that's the way it is; or: probabilities are best understood in decibels.
 Understanding deeplearning requires rethinking generalization
 Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals
 stateoftheart convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data
 95.2% accuracy is still very surprising for a million random labels from 1000 categories.
 Training time increases by a small scalar factor with random labels.
 Regularization via weight decay, dropout, and data augmentation do not eliminate fitting of random labels.
 Works even when the true images are replaced by random noise too.
 Depth two neural networks have perfect sample expressivity as soon as parameters > data points.
 The second bit of meat in the paper is section 5, Implicit regularization: an appeal to linear models.
 Have n data points ${x_i,y_i}$ where $x_i$ are ddimensional feature vectors, and $y_i$ are the labels.
 if we want to solve the fitting problem, $min_{w^T \elem R^d} \Sigma_{i=1}^{n} loss(w^T x_i,y_i)$  this is just linear regression, and if d > n, can fit exactly.
 The hessian of this function is degenerate  the curvature is meaningless, and does not inform generalization.
 With SGD, $w_{t+1} = w_t  \eta e_t x_{i_t}$ where $e_t$ is the prediction error.
 If we start at w=0, $w = \Sigma_{i=1}^{n} \alpha_i x_i$ for some coefficients $\alpha$ .
 Hence, $w = X^T \alpha$  the weights are in the span of the data points.
 If we interpolate perfectly, then $X w = y$
 Substitute, and get $X X^T \alpha = y$
 This is the "kernel trick" (Scholkopf et al 2001)
 Depends only on all the dotproducts between all the datapoints  it's a n*n linear system that can be solved exactly for small sets. (not pseudoinverse!)
 On mnist, this results in a 1.2% test error (!)
 With gabor wavelet preprocessing, the the error is 0.6% !
 Out of all models, SGD will converge to the model with the minimum norm (without weight decay)
 Norm is only a small part of the generalization puzzle.
Identifying and attacking the saddle point problem in highdimensional nonconvex optimization
Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited
Random deep neural networks are biased towards simple functions
Reconciling modern machine learning practice and the biasvariance tradeoff

* Watch the [http://homes.cs.washington.edu/~todorov/index.php?video=MordatchSIGGRAPH12&paper=Mordatch,%20SIGGRAPH%202012 movies!
Discovery of complex behaviors through contactinvariant optimization]
 Complex movements tend to have phases within which the set of active contacts (hands, feet) remains invariant (hence can exert forces on the objects they are contacting, or vice versa).
 Discovering suitable contact sets is the central goal of optimization in our approach.
 Once this is done, optimizing the remaining aspects of the movement tends to be relatively straightforward.
 They do this through axillary scalar variables which indicate whether the a contact is active or not, hence whether to enable contact forces.
 Allows the optimizer to 'realize' that movements should have phases.
 Also "shapes the energy landscape to be smoother and better behaved"
 Initial attempts to make these contact axillary variables discrete  when and where  which was easy for humans to specify, but made optimization intractable.
 Motion between contacts was modeled as a continuous feedback system.
 Instead, the contact variables $c_i$ have to be continuous.
 Contact forces are active only when $c_i$ is 'large'.
 Hence all potential contacts have to be enumerated in advance.
 Then, parameterize the end effector (position) and use inverse kinematics to figure out joint angles.
 Optimization:
 Break the movement up into a predefined number of phases, equal duration.
 Interpolate endeffector with splines
 Physics constraints are 'soft'  helps the optimizer : 'powerful continuation methods'
 That is, weight different terms differently in phases of the optimization process.
 Likewise, appendages are allowed to stretch and intersect, with a smooth cost.
 Contactinvariant cost penalizes distortion and slip (difference between endpoint and surface, measured normal, and velocity relative to contact point)
 Contact point is also 'soft' and smooth via distancenormalized weighting.
 All contact forces are merged into a $f \in \mathbb{R}^6$ vector, which includes both forces and torques. Hence contact force origin can move within the contact patch, which again makes the optimization smoother.
 Set $\tau(q, \dot{q}, \ddot{q}) = J(q)^T f + B u$ where $J(q)^T$ maps generalize (endpoint) velocities to contactpoint velocities, and f above are the contactforces. $B$ is to map control forces $u$ to the full space.
 $\tau(q, \dot{q}, \ddot{q}) = M(q)\dot{q} + C(q, \dot{q})\dot{q} + G(q)$  M is inertia matrix, C is Coriolis matrix, g is gravity.
 This means: forces need to add to zero. (friction $f$ + control $u$ = inertia + coriolis + gravity)
 Hence need to optimize $f$ and $u$ .
 Use frictioncone approximation for nongrab (feet) contact forces.
 These are optimized within a quadratic programming framework.
 LBFGS algo.
 Squared terms for friction and control, squared penalization for penetrating and slipping on a surface.
 Phases of optimization (continuation method):
 $L(s) = L_{CI}(s) + L_{physics}(s) + L_{task}(s) + L_{hint}(s)$
 task term only: wishful thinking.
 all 4 terms, physcics lessened  gradually add constraints.
 all terms, no hint, full physics.
 Total time to simulate 210 minutes per clip (only!)
 The equations of the paper seem incomplete  not clear how QP eq fits in with the $L(s)$ , and how $c_i$ fits in with $J(q)^T f + B u$
