m8ta
You are not authenticated, login.
text: sort by
tags: modified
type: chronology
{1571}
hide / / print
ref: -2022 tags: language learning symbolic regression Fleet meta search date: 06-04-2022 02:28 gmt revision:4 [3] [2] [1] [0] [head]

One model for the learning of language

  • Yuan Yang and Steven T. Piantadosi
  • Idea: Given a restricted compositional 'mentalese' programming language / substrate, construct a set of grammatical rules ('hypotheses') from a small number of examples of an (abstract) language.
    • Pinker's argument that there is too little stimulus ("paucity of stimulus") for children discern grammatical rules, hence they must be innate, is thereby refuted..
      • This is not the only refutation.
      • An argument was made on Twitter that large language models also refute the paucity of stimuli hypothesis. Meh, this paper does it far better -- the data used to train transformers is hardly small.
  • Hypotheses are sampled from the substrate using MCMC, and selected based on a smoothed Bayesian likelihood.
    • This likelihood takes into account partial hits -- results that are within an edit distance of one of the desired sets of strings. (i think)
  • They use Parallel tempering to search the space of programs.
    • Roughly: keep alive many different hypotheses, and vary the temperatures of each lineage to avoid getting stuck in local minima.
    • But there are other search heuristics; see https://codedocs.xyz/piantado/Fleet/
  • Excecution is on the CPU, across multiple cores / threads, possibly across multiple servers.
  • Larger hypotheses took up to 7 days to find (!)
    • These aren't that complicated of grammars..

  • This is very similar to {842}, only on grammars rather than continuous signals from MoCap.
  • Proves once again that:
    1. Many domains of the world can be adequately described by relatively simple computational structures (It's a low-D, compositional world out there)
      1. Or, the Johnson-Lindenstrauss lemma
    2. You can find those hypotheses through brute-force + heuristic search. (At least to the point that you run into the curse of dimensionality)

A more interesting result is Deep symbolic regression for recurrent sequences, where the authors (facebook/meta) use a Transformer -- in this case, directly taken from Vaswini 2017 (8-head, 8-layer QKV w/ a latent dimension of 512) to do both symbolic (estimate the algebraic recurrence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (log-scaled) noise!

While the language learning paper shows that small generative programs can be inferred from a few samples, the Meta symbolic regression shows that Transformers can evince either amortized memory (less likely) or algorithms for perception -- both new and interesting. It suggests that 'even' abstract symbolic learning tasks are sufficiently decomposable that the sorts of algorithms available to an 8-layer transformer can give a useful search heuristic. (N.B. That the transformer doesn't spit out perfect symbolic or numerical results directly -- it also needs post-processing search. Also, the transformer algorithm has search (in the form of softmax) baked in to it's architecture.)

This is not a light architecture: they trained the transformer for 250 epochs, where each epoch was 5M equations in batches of 512. Each epoch took 1 hour on 16 Volta GPUs w 32GB of memory. So, 4k GPU-hours x ~10 TFlops = 1.4e20 Flops. Compare this with grammar learning above; 7 days on 32 cores operating at ~ 3Gops/sec is 1.8e15 ops. Much, much smaller compute.

All of this is to suggest a central theme of computer science: a continuum between search and memorization.

  • The language paper does fast search, but does not learn from the process (bootstrap), and maintains little state/memory.
  • The symbolic regression paper does moderate amounts of search, but continually learns form the process, and stores a great deal of heuristics for the problem domain.

Most interesting for a visual neuroscientist (not that I'm one per se, but bear with me) is where on these axes (search, heuristic, memory) visual perception is. Clearly there is a high degree of recurrence, and a high degree of plasticity / learning. But is there search or local optimization? Is this coupled to the recurrence via some form of energy-minimizing system? Is recurrence approximating E-M?

{1570}
hide / / print
ref: -0 tags: Balduzzi backprop biologically plausible red-tape date: 05-31-2022 20:48 gmt revision:1 [0] [head]

Kickback cuts Backprop's red-tape: Biologically plausible credit assignment in neural networks

Bit of a meh -- idea is, rather than propagating error signals backwards through a hierarchy, you propagate only one layer + use a signed global reward signal. This works by keeping the network ‘coherent’ -- positive neurons have positive input weights, and negative neurons have negative weights, such that the overall effect of a weight change does not change sign when propagated forward through the network.

This is kind of a lame shortcut, imho, as it limits the types of functions that the network can model & the computational structure of the network. This is already quite limited by the dot-product-rectifier common structure (as is used here). Much more interesting and possibly necessary (given much deeper architectures now) is to allow units to change sign. (Open question as to whether they actually frequently do!). As such, the model is in the vein of "how do we make backprop biologically plausible by removing features / communication" rather than "what sorts of signals and changes does the brain use perceive and generate behavior".

This is also related to the literature on what ResNets do; what are the skip connections for? Amthropic has some interesting analyses for Transformer architectures, but checking the literature on other resnets is for another time.

{1569}
hide / / print
ref: -2022 tags: symbolic regression facebook AI transformer date: 05-17-2022 20:25 gmt revision:0 [head]

Deep symbolic regression for recurrent sequences

Surprisingly, they do not do any network structure changes; it’s Vaswini 2017w/ a 8-head, 8 layer transformer (sequence to sequence, not decoder only) with a latent dimension of 512.  Significant work was in feature / representation engineering (e.g. base-10k representations of integers and fixed-precision representations of floating-point numbers. (both of these involve a vocabulary size of ~10k ... amazing still that this works..)) + the significant training regimen they worked with (16 Turing GPUs, 32gb ea).  Note that they do perform a bit of beam-search over the symbolic regressions by checking how well each node fits to the starting sequence, but the models work even without this degree of refinement. (As always, there undoubtedly was significant effort spent in simply getting everything to work)

The paper does both symbolic (estimate the algebraic recurence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (log-scaled) noise!

Analysis of how the transformers work for these problems is weak; only one figure showing that the embeddings of the integers follows some meandering but continuous path in t-SNE space. Still, the trained transformer is able to usually best hand-coded sequence inference engine(s) in Mathematica, and does so without memorizing all of the training data. Very impressive and important result, enough to convince that this learned representation (and undiscovered cleverness, perhaps) beats human mathematical engineering, which probably took longer and took more effort.

It follows, without too much imagination (but vastly more compute), that you can train an 'automatic programmer' in the very same way.

{1568}
hide / / print
ref: -2021 tags: burst bio plausible gradient learning credit assignment richards apical dendrites date: 05-05-2022 15:44 gmt revision:2 [1] [0] [head]

Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits

  • Roughly, single-events indicate the normal feature responses of neurons, while multiple-spike bursts indicate error signals.
  • Bursts are triggered by depolarizing currents to the apical dendrites, which can be uncoupled from bottom-up event rate, which arises from perisomatic inputs / basal dendrites.
  • The fact that the two are imperfectly multiplexed is OK, as in backprop the magnitude of the error signal is modulated by the activity of the feature detector.
  • "For credit assignment in hierarchical networks, connections should obey four constraints:
    • Feedback must steer the magnitude and sign of plasticity
    • Feedback signals from higher-order areas must be multipleed with feedforward signals from lower-order areas so that credit assignment can percolate down the hierarch with minimal effect on sensory information
    • There should be some form of alignment between feedforward and feedback connections
    • Integration of credit-carrying signals should be nearly linear to avoid saturation
      • Seems it's easy to saturate the burst probability within a window of background event rate, e.g. the window is all bursts to no bursts.
  • Perisomatic inputs were short-term depressing, whereas apical dendrite synapses were short-term facilitating.
    • This is a form of filtering on burst rates? E.g. the propagate better down than up?
  • They experiment with a series of models, one for solving the XOR task, and subsequent for MNIST and CIFAR.
  • The later, larger models are mean-field models, rather than biophysical neuron models, and have a few extra features:
    • Interneurons, presumably SOM neurons, are used to keep bursting within a linear regime via a 'simple' (supplementary) learning rule.
    • Feedback alignment occurs by adjusting both the feedforward and feedback weights with the same propagated error signal + weight decay.
  • The credit assignment problem, or in the case of unsupervised learning, the coordination problem, is very real: how do you change a middle-feature to improve representations in higher (and lower) levels of the hierarchy?
    • They mention that using REINFORCE on the same network was unable to find a solution.
    • Put another way: usually you need to coordinate the weight changes in a network; changing weights individually based on a global error signal (or objective function) does not readily work...
      • Though evolution seems to be quite productive at getting the settings of (very) large sets of interdependent coefficients all to be 'correct' and (sometimes) beautiful.
      • How? Why? Friston's free energy principle? Lol.

{1567}
hide / / print
ref: -0 tags: evolution simplicity symmetry kolmogorov complexity polyominoes protein interactions date: 04-21-2022 18:22 gmt revision:5 [4] [3] [2] [1] [0] [head]

Symmetry and simplicity spontaneously emerge from the algorithmic nature of evolution

  • Central hypothesis is that simplicity and symmetry arrive not through natural selection, but because these form are overwhelmingly represented in the genotype-phenotype map
  • Experimental example here was "polyominoes", where there are N=16 tiles, each with a 4 numbers (encoded with e.g. 6-bit binary numbers). The edge numbers determine how the tiles irreversibly bind, e.g. 1 <-> 2, 3 <-> 4 etc, with 4 and 2^6-1 binding to nothing.
  • These tiles are allowed to 'randomly' self-assemble. Some don't terminate (e.g. they form continuous polymers); these are discarded; others do terminate (no more available binding sites).
  • They assessed the complexity of both polyominoes selected for a particular size, eg 16 tiles, or those not selected at all, other than terminating.
  • In both complexity was assessed based on how many actual interactions were needed to make the observed structure. That is, they removed tile edge numbers and kept it if it affected the n-mer formation.
  • Result was this nice log-log plot:
  • Showed that this same trend holds for protein-protein complexes (weaker result, imho)
  • As well as RNA secondary structure
  • And metabolic time-series in a ODE modeled on yeast metabolism (even weaker result..)

The paper features a excellent set of references, including:
Letter to a friend following her article Machine learning in evolutionary studies comes of age

Read your PNAS article last night, super interesting that you can get statistical purchase on long-lost evolutionary 'sweeps' via GANs and other neural network models.  I feel like there is some sort of statistical power issue there?  DNNs are almost always over-parameterized... slightly suspicious.

This morning I was sleepily mulling things over & thought about a walking conversation that we had a long time ago in the woods of NC:  Why is evolution so effective?  Why does it seem to evolve to evolve?  Thinking more -- and having years more perspective -- it seems almost obvious in retrospect: it's a consequence of Bayes' rule.  Evolution finds solutions in spaces that have overwhelming prevalence of working solutions.  The prior has an extremely strong effect.  These representational / structural spaces by definition have many nearby & associated solutions, hence appear post-hoc 'evolvable'.  (You probably already know this.)

I think proteins very much fall into this category: AA were added to the translation machinery based on ones that happened to solve a particular problem... but because of the 'generalization prior' (to use NN parlance), they were useful for many other things.  This does not explain the human-engineering-like modularity of mature evolved systems, but maybe that is due to the strong simplicity prior [1]

Very very interesting to me is how the science of evolution and neural networks are drawing together, vis a vis the lottery ticket hypothesis.  Both evince a continuum of representational spaces, too, from high-dimensional vectoral (how all modern deep learning systems work) to low-dimensional modular, specific, and general (phenomenological human cognition).  I suspect that evolution uses a form of this continuum, as seen in the human high-dimensional long-range gene regulatory / enhancer network (= a structure designed to evolve).  Not sure how selection works here, though; it's hard to search a high-dimensional space.  The brain has an almost identical problem: it's hard to do 'credit assignment' in a billions-large, deep and recurrent network.  Finding which set of synapses caused a good / bad behaviior takes a lot of bits.

{1566}
hide / / print
ref: -1992 tags: evolution baldwin effect ackley artificial life date: 03-21-2022 23:20 gmt revision:0 [head]

Interactions between learning and evolution

  • Ran simulated evolution and learning on a population of agents over ~100k lifetimes.
  • Each agent can last several hundred timesteps with a gridworld like environment.
  • Said gridworld environment has plants (food), trees (shelter), carnivores, and other agents (for mating)
  • Agent behavior is parameterized by an action network and a evaluation network.
    • The action network transforms sensory input into actions
    • The evaluation network sets the valence (positive or negative) of the sensory signals
      • This evaluation network modifies the weights of the action network using a gradient-based RL algorithm called CRBP (complementary reinforcement back-propagation) which reinforces based on the temporal derivative, and complements (negative) when action does not increase reward, with some e-greedy exploration.
        • It's not perfect, but as they astutely say, any reinforcement learning algorithm involves some search, so generally heuristics are required to select new actions in the face of uncertainty.
      • Observe that it seems easier to make a good evaluation network than action network (evaluation network is lower dimensional -- one output!)
    • Networks are implemented as one-layer perceptrons (boring, but they had limited computational resources back then)
  • Showed (roughly) that in winner populations you get:
    • When learning is an option, the population will learn, and with time this will grow to anticipation / avoidance
    • This will transition to the Baldwin effect; learned behavior becomes instinctive
      • But, interestingly, only when the problem is incompletely solved!
      • If it's completely solved by learning (eg super fast), then there is no selective leverage on innate behavior over many generations.
      • Likewise, the survival problem to be solved needs to be stationary and consistent for long enough for the Baldwin effect to occur.
    • Avoidance is a form of shielding, and learning no longer matters on this behavior
    • Even longer term, shielding leads to goal regression: avoidance instincts allow the evaluation network to do something else, set new goals.
      • In their study this included goals such as approaching predators (!).

Altogether (historically) interesting, but some of these ideas might well have been anticipated by some simple hand calculations.

{1565}
hide / / print
ref: -0 tags: nvidia gpuburn date: 02-03-2022 20:27 gmt revision:6 [5] [4] [3] [2] [1] [0] [head]

Compiling a list of saturated matrix-matrix gflops for various Nvidia GPUs.

  1. GTX 1050 Mobile in a Lenovo Yoga 720
    1. ?? W
    2. 1640 Gflops/sec (float)
    3. 65 Gflops/sec (double)
    4. 2 GBb ram, 640 cores, ?? clock / (64C)
  2. T2000 in a Lenovo P1 Gen 3
    1. 34 W
    2. 2259 GFlops/sec (float)
    3. 4 Gb ram, 1024 cores, clock 1185 / 7000 MHz
  3. GTX 1650 Max-Q in a Lenovo X1 extreme Gen 3
    1. 35 W
    2. 2580 GFlops/sec (float)
    3. 116 GFlops/sec (double)
    4. 4 Gb ram, 1024 cores, clock 1335 (float) 1860 (double) / 10000 MHz (56C)
  4. RTX 3080 in a MSI Creator 17
    1. 80 W
    2. 5400 GFlops/sec (float)
    3. 284 GFlops/sec (double)
    4. 16 Gb ram, 6144 cores, clock 855 (float) 1755 (double) / 12000 MHz (68C)
      1. Notable power / thermal throttling on this laptop.
  5. EVGA RTX 2080Ti
    1. 260 W
    2. 11800 GFlops / sec (float)
    3. 469 GFlops / sec (double)
    4. 11 Gb ram, 4352 cores, clock 1620 (float) 1905 (double) / 13600 MHz (74C)

{1564}
hide / / print
ref: -2008 tags: t-SNE dimensionality reduction embedding Hinton date: 01-25-2022 20:39 gmt revision:2 [1] [0] [head]

“Visualizing data using t-SNE”

  • Laurens van der Maaten, Geoffrey Hinton.
  • SNE: stochastic neighbor embedding, Hinton 2002.
  • Idea: model the data conditional pairwise distribution as a gaussian, with one variance per data point, p(x i|x j) p(x_i | x_j)
  • in the mapped data, this pairwise distribution is modeled as a fixed-variance gaussian, too, q(y i|y j) q(y_i | y_j)
  • Goal is to minimize the Kullback-Leibler divergence Σ iKL(p i||q i) \Sigma_i KL(p_i || q_i) (summed over all data points)
  • Per-data point variance is found via binary search to match a user-specified perplexity. This amounts to setting a number of nearest neighbors, somewhere between 5 and 50 work ok.
  • Cost function is minimized via gradient descent, starting with a random distribution of points yi, with plenty of momentum to speed up convergence, and noise to effect simulated annealing.
  • Cost function is remarkably simple to reduce, gradient update: δCδy i=2Σ j(p j|iq ji+p i|jq i|j)(y iy j) \frac{\delta C}{\delta y_i} = 2 \Sigma_j(p_{j|i} - q_{j-i} + p_{i|j} - q_{i|j})(y_i - y_j)
  • t-SNE differs from SNE (above) in that it addresses difficulty in optimizing the cost function, and crowding.
    • Uses a simplified symmetric cost function (symmetric conditional probability, rather than joint probability) with simpler gradients
    • Uses the student’s t-distribution in the low-dimensional map q to reduce crowding problem.
  • The crowding problem is roughly resultant from the fact that, in high-dimensional spaces, the volume of the local neighborhood scales as r m r^m , whereas in 2D, it’s just r 2 r^2 . Hence there is cost-incentive to pushing all the points together in the map -- points are volumetrically closer together in high dimensions than they can be in 2D.
    • This can be alleviated by using a one-DOF student distribution, which is the same as a Cauchy distribution, to model the probabilities in map space.
  • Smart -- they plot the topology of the gradients to gain insight into modeling / convergence behavior.
  • Don’t need simulated annealing due to balanced attractive and repulsive effects (see figure).
  • Enhance the algorithm further by keeping it compact at the beginning, so that clusters can move through each other.
  • Look up: d-bits parity task by Bengio 2007

{1563}
hide / / print
ref: -0 tags: date: 01-09-2022 19:04 gmt revision:1 [0] [head]

The Sony Xperia XZ1 compact is a better phone than an Apple iPhone 12 mini

I don't normally write any personal options here -- just half-finished paper notes riddled with typos (haha) -- but this one has been bothering me for a while.

November 2020 I purchased an iPhone 12 mini to replace my aging Sony Xperia XZ1 compact. (Thinking of staying with Android, I tried out a Samsung S10e as well, but didn't like it.) Having owned and used the iPhone for a year and change, I still prefer the Sony. Here is why:

  • Touch screen
    • The iPhone is MUCH more sensitive to sweat than the Sony
    • This is the biggest problem, since I like to move (hike, bike, kayak etc), it lives in my pocket, and inevitably gets a bit of condensation or water on it.
    • The ipPhone screen is rendered frustrating to use with even an imperceptible bit of moisture on it.
      • Do iPhone users not sweat?
      • Frequently I can't even select the camera app! Or switch to maps!
        • A halfway fix is to turn the screen off then on again. Halfway.
    • The Sony, in comparison, is relatively robust, and works even to the point where there were droplets of water on it.
  • Size
    • They are both about the same size with a case, Sony is 129 x 65 x 9.3 mm ; iPhone mini is 131.5 x 64.2 x 7.4mm.
    • This size is absolutely perfect and manufacturers need to make more phones with these dimensions!
    • If anything, the iPhone is better here -- the rounded corners are nice.
  • Battery
    • Hands down, the Sony. Lasts >=2x as long as the iPhone.
  • Processor
    • Both are fast enough.
  • Software
    • Apple is not an ecosystem. No. It's a walled garden where a select few plants may grow. You do what Apple wants you to do.
      • E.g. want to use any Google apps on iPhone? No problem! Want to use any Apple apps on Android or web or PC? Nope, sorry, you have to buy a $$$ MacBook pro.
    • Ok, the privacy on an iPhone is nice. Modulo that bit where they scanned our photos.
      • As well as the ability to manage notifications & basically turn them all off :)
    • There are many more apps on Android, and they are much less restricted in what they can do.
      • For example, recently we were in the desert & wanted a map of where the cell signal was strong, for remote-working. This is easy on Android (there is an app for it).
        • This is impossible on iPhone (the apps don't have access to the information).
      • Second example, you can ssh into an Android and use that to download large files (e.g. packages, datasets) to avoid using limited tethering data.
        • This is also impossible on iPhone.
    • Why does iMessage make all texts from Android users yucky green? Why is there zero option to change this?
    • Why does iMessage send very low resolution photos to my friends and family using Android? It sends beautiful full-res photos to other Apple phones.
    • Why is there no web interface to iMessage?
      • Ugh, this iPhone is such an elitist snob.
    • You can double-tap on the square in Android to quickly switch between apps, which is great.
    • Apple noticeably auto-corrects to a smaller vocabulary than desired. Android is less invasive in this respect.
  • Cell signal
    • They are similarly unreliable, though the iPhone has 5G & many more wireless band, which is great.
    • Still, frequently I'll have one-two bars of connectivity & yet Google Maps will say "you are offline". This is much less frequent on the Sony.
  • Screen
    • iPhone screen is better.
  • Camera
    • iPhone camera is very very much better.
  • Speaker
    • iPhone speaker much better. But it sure burns the battery.
  • Wifi
    • iPhone will periodically disconnect from Wifi when on Facetime calls. Sony doesn't do this.
      • Facetime only works with Apple devices.
  • Price
    • Sony wins
  • Unlock
    • Face unlock is a cool idea, but we all wear masks now.
    • The Sony has a fingerprint sensor, which is better.
      • In the case where I'm moving (and possibly sweaty), Android is smart enough to allow quick unlock, for access to the camera app or maps. Great feature.

Summary: I'll try to get my moneys worth out of the iPhone; when it dies, will buy the smallest waterproof Android phone that supports my carrier's bands.

{1561}
hide / / print
ref: -0 tags: date: 01-09-2022 19:03 gmt revision:1 [0] [head]

Cortical response selectivity derives from strength in numbers of synapses

  • Benjamin Scholl, Connon I. Thomas, Melissa A. Ryan, Naomi Kamasawa & David Fitzpatrick
  • "Using electron microscopy reconstruction of individual synapses as a metric of strength, we find no evidence that strong synapses have a predominant role in the selectivity of cortical neuron responses to visual stimuli. Instead, selectivity appears to arise from the total number of synapses activated by different stimuli."
  • "Our results challenge the role of Hebbian mechanisms in shaping neuronal selectivity in cortical circuits, and suggest that selectivity reflects the co-activation of large populations of presynaptic neurons with similar properties and a mixture of strengths. "
    • Interesting -- so this is consistent with ANNs / feature detectors / vector hypothesis.
    • It would imply that the mapping is dense rather than sparse -- but to see this, you'd need to record the activity of all these synapses in realtime.
      • Which is possible, (e.g. lightbeads, fast axial focusing), just rather difficult for now.
  • To draw really firm conclusions, would need a thorough stimulus battery, not just drifting gratings.
    • It may change this result: "Surprisingly, the strength of individual synapses was uncorrelated with functional similarity to the somatic output (that is, absolute orientation preference difference)"

{842}
hide / / print
ref: work-0 tags: distilling free-form natural laws from experimental data Schmidt Cornell automatic programming genetic algorithms date: 12-30-2021 05:11 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

Distilling free-form natural laws from experimental data

  • The critical step was to use the full set of all pairs of partial derivatives ( δx/δy\delta x / \delta y ) to evaluate the search for invariants.
  • The selection of which partial derivatives are held to be independent / which variables are dependent is a bit of a trick too -- see the supplemental information.
    • Even yet, with a 4D data set the search for natural laws took ~ 30 hours.
  • This was via a genetic algorithm, distributed among 'islands' on different CPUs, with mutation and single-point crossover.
  • Not sure what the IL is, but it appears to be floating-point assembly.
  • Timeseries data is smoothed with Loess smoothing, which fits a polynomial to the data, and hence allows for smoother / more analytic derivative calculation.
    • Then again, how long did it take humans to figure out these invariants? (Went about it in a decidedly different way..)
    • Further, how long did it take for biology to discover similar 'design equations'?
      • The same algorithm has been applied to biological data - a metabolic pathway - with some success pub 2011.
      • Of course evolution had to explore a much larger space - proteins and regulatory pathways, not simpler mathematical expressions / linkages.


Since his Phd, Michael Schmidt has gone on to found Nutonian, which produced Eurequa software, apparently without dramatic new features other than being able to use the cloud for equation search. (Probably he improved many other detailed facets of the software..). Nutonian received $4M in seed funding, according to Crunchbase.

In 2017, Nutonian was acquired by Data Robot (for an undisclosed amount), where Michael has worked since, rising to the title of CTO.

Always interesting to follow up on the authors of these classic papers!

{1562}
hide / / print
ref: -0 tags: SAT solver blog post date: 12-30-2021 00:29 gmt revision:0 [head]

Modern SAT solvers: fast, neat and underused (part 1 of N)

A set of posts that are worth re-reading.

{1560}
hide / / print
ref: -2021 tags: synaptic imaging weights 2p oregon markov date: 12-29-2021 23:30 gmt revision:2 [1] [0] [head]

Distinct in vivo dynamics of excitatory synapses onto cortical pyramidal neurons and parvalbumin-positive interneurons

  • Joshua B.Melander, Aran Nayebi, Bart C.Jongbloets, Dale A.Fortin, Maozhen Qin, Surya Ganguli, Tianyi Mao, Haining Zhong
  • Cre-dependent mVenus labeled PSD-95, in both excitatory pyramidadl neurons & inhibitory PV interneurons.
  • morphology labeled with tdTomato
  • Longitudinal imaging of individual exictatoy post-synaptic densityies; estimated weight from fluorescence; examined spine appearance and disappearance
  • PV synapses were more stable over the 24-day period than synapses on pyramidal neurons.
  • Likewise, large synapses were more likely to remain over the imaging period.
  • Both followed log-normal distributions in 'strengths'
  • Changes were well modeled by a Markov process, which puts high probability on small changes.
  • But these changes are multiplicative (+ additive component in PV cells)

{1559}
hide / / print
ref: -0 tags: diffusion models image generation OpenAI date: 12-24-2021 05:50 gmt revision:0 [head]

Some investigations into denoising models & their intellectual lineage:

Deep Unsupervised Learning using Nonequilibrium Thermodynamics 2015

  • Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
  • Starting derivation of using diffusion models for training.
  • Verrry roughly, the idea is to destroy the structure in an image using diagonal Gaussian per-pixel, and train an inverse-diffusion model to remove the noise at each step. Then start with Gaussian noise and reverse-diffuse an image.
  • Diffusion can take 100s - 1000s of steps; steps are made small to preserve the assumption that the conditional probability, p(x t1|x t)N(0,I)p(x_{t-1}|x_t) \propto N(0, I)
    • The time variable here goes from 0 (uncorrupted data) to T (fully corrupted / Gaussian noise)

Generative Modeling by Estimating Gradients of the Data Distribution July 2019

  • Yang Song, Stefano Ermon

Denoising Diffusion Probabilistic Models June 2020

  • Jonathan Ho, Ajay Jain, Pieter Abbeel
  • A diffusion model that can output 'realistic' images (low FID / low log-likelihood )

Improved Denoising Diffusion Probabilistic Models Feb 2021

  • Alex Nichol, Prafulla Dhariwal
  • This is directly based on Ho 2020 and Shol-Dickstein 2015, but with tweaks
  • The objective is no longer the log-likelihood of the data given the parameters (per pixel); it's now mostly the MSE between the corrupting noise (which is known) and the estimated noise.
  • That is, the neural network model attempts, given x tx_t to estimate the noise which corrupted it, which then can be used to produce x t1x_{t-1}
    • Simpicity. Satisfying.
  • The also include a reweighted version of the log-likelihood loss, which puts more emphasis on the first few steps of noising. These steps are more important for NLL; reweighting also smooths the loss.
    • I think that, per Ho above, the simple MSE loss is sufficient to generate good images, but the reweighted LL improves the likelihood of the parameters.
  • There are some good crunchy mathematical details on how how exactly the the mean and variance of the estimated Gaussian distributions are handled -- at each noising step, you need to scale the mean down to prevent Brownian / random walk.
    • Taking these further, you can estimate an image at any point t in the forward diffusion chain. They use this fact to optimize the function approximator (a neural network; more later) using a (random but re-weighted/scheduled) t and the LL loss + simple loss.
  • Ho 2020 above treats the variance of the noising Gaussian as fixed -- that is, β \beta ; this paper improves the likelihood by adjusting the noise varaince mostly at the last steps by a ~β t~\beta_t , and then further allowing the function approximator to tune the variance (a multiplicative factor) per inverse-diffusion timestep.
    • TBH I'm still slightly foggy on how you go from estimating noise (this seems like samples, concrete) to then estimating variance (which is variational?). hmm.
  • Finally, they schedule the forward noising with a cosine^2, rather than a linear ramp. This makes the last phases of corruption more useful.
  • Because they have an explicit parameterization of the noise varaince, they can run the inverse diffusion (e.g. image generation) faster -- rather than 4000 steps, which can take afew minutes on a GPU, they can step up the variance and run it only for 50 steps and get nearly as good images.

Diffusion Models Beat GANs on Image Synthesis May 2021

  • Prafulla Dhariwal, Alex Nichol

In all of above, it seems that the inverse-diffusion function approximator is a minor player in the paper -- but of course, it's vitally important to making the system work. In some sense, this 'diffusion model' is as much a means of training the neural network as it is a (rather inefficient, compared to GANs) way of sampling from the data distribution. In Nichol & Dhariwal Feb 2021, they use a U-net convolutional network (e.g. start with few channels, downsample and double the channels until there are 128-256 channels, then upsample x2 and half the channels) including multi-headed attention. Ho 2020 used single-headed attention only at the 16x16 level. Ho 2020 in turn was based on PixelCNN++

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications Jan 2017

  • Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma

which is an improvement to (e.g. add selt-attention layers)

Conditional Image Generation with PixelCNN Decoders

  • Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu

Most recently,

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  • Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen

Added text-conditional generation + many more parameters + much more compute to yield very impressive image results + in-painting. This last effect is enabled by the fact that it's a full generative denoising probabilistic model -- you can condition on other parts of the image!

{1558}
hide / / print
ref: -2021 tags: hippocampal behavior scale plasticity Magee Romani Bittner date: 12-20-2021 22:39 gmt revision:0 [head]

Bidirectional synaptic plasticity rapidly modifies hippocampal representations

  • Normal Hebbian plasticity depends on pre and post synaptic activity & their time course.
  • Three-factor plasticity depends on pre, post, and neuromodulatory activity, typically formalized as an eligibility trace (ET) and instructive signal (IS).
  • Here they show that dendritic-plateau dependent hippocampal place field generation, in particular LTD, is not (quite so) dependent on post synaptic activity.
  • Instead, it appears to be a 'register update' operation, where a new pattern is remembered (through LTP) and an old pattern is forgotten (through LTD).
    • That is, the synapses are updating information, not accumulating information.
  • The eq for a single synapse: ΔW/δt=(W maxW)k +q +(ET*IS)Wk q (ET*IS)\Delta W / \delta t = (W_{max} - W) k^+ q^+(ET * IS) - W k^- q^-(ET * IS)
    • Where k are the learning rates, and q are the nonlinear functions regulating potentiation / depression based on eligibility trace and instructive signal.

I'm still not 100% sure that this excludes any influence on presynaptic activity ... they didn't control for that. But certainly LTD in their model does not require postsynaptic activity; indeed, it may only require net-synaptic homeostasis.

{1557}
hide / / print
ref: -0 tags: SVD vocabulary english latent vector space Plato date: 12-20-2021 22:27 gmt revision:1 [0] [head]

A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge

  • A whole lot of verbiage here for an old, important, but relatively straightforward result:
    • Take ~30k encyclopedia articles.
    • From them, make a vocabulary of ~ 60k words.
    • Form a sparse matrix with rows being the vocabulary word, and columns being the encyclopedia article.
    • Perform large, sparse SVD on this matrix.
      • How? He doesn't say.
    • Take the top 300 singular values & associated V vectors, and use these as an embedding space for vocabulary.
  • The 300-dim embedding can then be used to perform analysis to solve TOEFL synonym problems
    • Map the cue and the multiple choice query words to 300-dim space, and select the one with the highest cosine similarity.

The fact that sVD works at all, and pulls out some structure is interesting! Not nearly as good as word2vec.

{1556}
hide / / print
ref: -0 tags: concept net NLP transformers graph representation knowledge date: 11-04-2021 17:48 gmt revision:0 [head]

Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

  • From a team at University of Washington / Allen institute for artificial intelligence/
  • Courtesy of Yannic Kilcher's youtube channel.
  • General idea: use GPT-3 as a completion source given a set of prompts, like:
    • X starts running
      • So, X gets in shape
    • X and Y engage in an argument
      • So, X wants to avoid Y.
  • There are only 7 linkage atoms (edges, so to speak) in these queries, but of course many actions / direct objects.
    • These prompts are generated from the Atomic 20-20 human-authored dataset.
    • The prompts are fed into 175B parameter DaVinci model, resulting in 165k examples in the 7 linkages after cleaning.
    • In turn the 165k are fed into a smaller version of GPT-3, Curie, that generates 6.5M text examples, aka Atomic 10x.
  • Then filter the results via a second critic model, based on fine-tuned RoBERTa & human supervision to determine if a generated sentence is 'good' or not.
  • By throwing away 62% of Atomic 10x, they get a student accuracy of 96.4%, much better than the human-designed knowledge graph.
    • They suggest that one way thins works is by removing degenerate outputs from GPT-3.

Human-designed knowledge graphs are described here: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

And employed for profit here: https://www.luminoso.com/

{1549}
hide / / print
ref: -0 tags: gtk.css scrollbar resize linux qt5 date: 10-28-2021 18:47 gmt revision:3 [2] [1] [0] [head]

Put this in ~/.config/gtk-3.0/gtk.css make scrollbars larger on high-DPI screens. ref

.scrollbar {
  -GtkScrollbar-has-backward-stepper: 1;
  -GtkScrollbar-has-forward-stepper: 1;
  -GtkRange-slider-width: 16;
  -GtkRange-stepper-size: 16;
}
scrollbar slider {
    /* Size of the slider */
    min-width: 16px;
    min-height: 16px;
    border-radius: 16px;

    /* Padding around the slider */
    border: 2px solid transparent;
}

.scrollbar.vertical slider,
scrollbar.vertical slider {
    min-height: 16px;
    min-width: 16px;
}

.scrollbar.horizontal.slider,
scrollbar.horizontal slider {
min-width: 16px;
min-height: 16px;
}

/* Scrollbar trough squeezes when cursor hovers over it. Disabling that
 */

.scrollbar.vertical:hover:dir(ltr),
.scrollbar.vertical.dragging:dir(ltr) {
    margin-left: 0px;
}

.scrollbar.vertical:hover:dir(rtl),
.scrollbar.vertical.dragging:dir(rtl) {
    margin-right: 0px;
}

.scrollbar.horizontal:hover,
.scrollbar.horizontal.dragging,
.scrollbar.horizontal.slider:hover,
.scrollbar.horizontal.slider.dragging {
    margin-top: 0px;
}
undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { background-image: none; }
undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { background-image: none; }

Also add:

export GTK_OVERLAY_SCROLLING=0 
to your ~/.bashrc

To make the scrollbars a bit easier to see in QT5 applications, run qt5ct (after apt-getting it), and add in a new style sheet, /usr/share/qt5ct/qss/scrollbar-simple-backup.qss

/* SCROLLBARS (NOTE: Changing 1 subcontrol means you have to change all of them)*/
QScrollBar{
  background: palette(alternate-base);
}
QScrollBar:horizontal{
  margin: 0px 0px 0px 0px;
}
QScrollBar:vertical{
  margin: 0px 0px 0px 0px;
}
QScrollBar::handle{
  background: #816891;
  border: 1px solid transparent;
  border-radius: 1px;
}
QScrollBar::handle:hover, QScrollBar::add-line:hover, QScrollBar::sub-line:hover{
  background: palette(highlight);
}
QScrollBar::add-line{
subcontrol-origin: none;
}
QScrollBar::add-line:vertical, QScrollBar::sub-line:vertical{
height: 0px;
}
QScrollBar::add-line:horizontal, QScrollBar::sub-line:horizontal{
width: 0px;
}
QScrollBar::sub-line{
subcontrol-origin: none;
}

{1555}
hide / / print
ref: -0 tags: adaptive optics two photon microscopy date: 10-26-2021 18:17 gmt revision:1 [0] [head]

Recently I've been underwhelmed by the performance of adaptive optics (AO) for imaging head-fixed cranial-window mice. There hasn't been much of an improvement, despite significant optimization effort. This begs the question: where are AO microscopes used?

When the purpose of a paper is to explain and qualify an novel AO approach, the improvement is always good, >> 2x. Yet, in the one paper (first below) when the purpose was neuroscience, not optics, the results are less inspiring. Are the results from the optics papers cherry-picked?

Thalamus provides layer 4 of primary visual cortex with orientation- and direction-tuned inputs Wenzhi Sun, Zhongchao Tan, Brett D Mensh & Na Ji 2016 https://www.nature.com/articles/nn.4196

  • This is the primary (only?) paper where AO was used, but the focus was biology: measuring the tuning properties of thalamic boutons in mouse visual cortex. Which they did, well!
  • Surprisingly, the largest improvement was not from using AO, but rather from thinning the cranial window from 340um to 170um.
  • "With a 340-μm-thick cranial window, 70% of all boutons appeared to be non-responsive to visual stimuli and only 7% satisfied OS criteria. With a thinner cranial window of 170-μm thickness, we found that 31% of boutons satisfied OS criteria (of total n = 1,302, 5 mice), which was still substantially fewer than 48% OS boutons as determined when the same boutons (n = 1,477, 5 mice) were imaged after aberration correction by adaptive optics"

Direct wavefront sensing for high-resolution in vivo imaging in scattering tissue Kai Wang, Wenzhi Sun, Christopher T. Richie, Brandon K. Harvey, Eric Betzig & Na Ji, 2015 https://www.nature.com/articles/ncomms8276

  • Direct wavefront sensing using indocayanine green + Andor iXon 897 EMCCD Shack-Hartmann wavefront sensor (read: expensive).
  • Alpao DM97-15, basically the same as ours.
  • Fairly local wavefront corrections, see figure 2.
  • Also note that these wavefront corrections seem low-order, hence should be correctable via a DM

Multiplexed aberration measurement for deep tissue imaging in vivo Chen Wang, Rui Liu, Daniel E Milkie, Wenzhi Sun, Zhongchao Tan, Aaron Kerlin, Tsai-Wen Chen, Douglas S Kim & Na Ji 2014 https://www.nature.com/articles/nmeth.3068

  • Use a DMD (including a dispersion pre-compensator) to amplitude modulate phase ramps on a wavefront-modulating SLM. Each phase-ramp segment of the SLM was modulated at a different frequency, allowing for the optimal phase to be pulled out later through a Fourier transform.
  • Again, very good performance at depth in the mouse brain.

{1554}
hide / / print
ref: -2021 tags: FIBSEM electron microscopy presynaptic plasticity activity Funke date: 10-12-2021 17:03 gmt revision:0 [head]

Ultrastructural readout of in vivo synaptic activity for functional connectomics

  • Anna Simon, Arnd Roth, Arlo Sheridan, Mehmet Fişek, Vincenzo Marra, Claudia Racca, View ORCID ProfileJan Funke, View ORCID ProfileKevin Staras, Michael Häusser
  • Did FIB-SEM on FM1-43 dye labeled synapses, then segmented the cells using machine learning, as Jan has pioneered.
    • FM1-43FX is membrane impermeable, and labels only synaptic vesicles that have been recycled after dye loading. (Invented in 1992!)
    • FM1-43FX is also able to photoconvert diaminobenzidene (DAB) into a amorphous highly conjugated polymer with high affinity for osmium tetroxide
  • This allows for a snapshot of ultrastructural presynaptic plasticity / activity.
  • N=84 boutons, but n=7 pairs / triples of boutons from the same axon.
    • These boutons have the same presynaptic spiking activity, and hence are expected to have the same release probability, and hence the same photoconversion (PC) labeling.
      • But they don't! The ratio of PC+ vesicle numbers between boutons on the same neuron is low, mean < 0.4, which suggests some boutons have high neurotransmitter release and recycling, others have low...
  • Quote in the abstract: We also demonstrate that neighboring boutons of the same axon, which share the same spiking activity, can differ greatly in their presynaptic release probability.
    • Well, sorta, the data here is a bit weak. It might all be lognormal fluctuations, as has been well demonstrated.
    • When I read it I was excited to think of the influence of presynaptic inhibition / modulation, which has not been measured here, but is likely to be important.