m8ta
You are not authenticated, login.
text: sort by
tags: modified
type: chronology
{1568}
hide / / print
ref: -2021 tags: burst bio plausible gradient learning credit assignment richards apical dendrites date: 05-05-2022 15:44 gmt revision:2 [1] [0] [head]

Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits

  • Roughly, single-events indicate the normal feature responses of neurons, while multiple-spike bursts indicate error signals.
  • Bursts are triggered by depolarizing currents to the apical dendrites, which can be uncoupled from bottom-up event rate, which arises from perisomatic inputs / basal dendrites.
  • The fact that the two are imperfectly multiplexed is OK, as in backprop the magnitude of the error signal is modulated by the activity of the feature detector.
  • "For credit assignment in hierarchical networks, connections should obey four constraints:
    • Feedback must steer the magnitude and sign of plasticity
    • Feedback signals from higher-order areas must be multipleed with feedforward signals from lower-order areas so that credit assignment can percolate down the hierarch with minimal effect on sensory information
    • There should be some form of alignment between feedforward and feedback connections
    • Integration of credit-carrying signals should be nearly linear to avoid saturation
      • Seems it's easy to saturate the burst probability within a window of background event rate, e.g. the window is all bursts to no bursts.
  • Perisomatic inputs were short-term depressing, whereas apical dendrite synapses were short-term facilitating.
    • This is a form of filtering on burst rates? E.g. the propagate better down than up?
  • They experiment with a series of models, one for solving the XOR task, and subsequent for MNIST and CIFAR.
  • The later, larger models are mean-field models, rather than biophysical neuron models, and have a few extra features:
    • Interneurons, presumably SOM neurons, are used to keep bursting within a linear regime via a 'simple' (supplementary) learning rule.
    • Feedback alignment occurs by adjusting both the feedforward and feedback weights with the same propagated error signal + weight decay.
  • The credit assignment problem, or in the case of unsupervised learning, the coordination problem, is very real: how do you change a middle-feature to improve representations in higher (and lower) levels of the hierarchy?
    • They mention that using REINFORCE on the same network was unable to find a solution.
    • Put another way: usually you need to coordinate the weight changes in a network; changing weights individually based on a global error signal (or objective function) does not readily work...
      • Though evolution seems to be quite productive at getting the settings of (very) large sets of interdependent coefficients all to be 'correct' and (sometimes) beautiful.
      • How? Why? Friston's free energy principle? Lol.

{1567}
hide / / print
ref: -0 tags: evolution simplicity symmetry kolmogorov complexity polyominoes protein interactions date: 04-21-2022 18:22 gmt revision:5 [4] [3] [2] [1] [0] [head]

Symmetry and simplicity spontaneously emerge from the algorithmic nature of evolution

  • Central hypothesis is that simplicity and symmetry arrive not through natural selection, but because these form are overwhelmingly represented in the genotype-phenotype map
  • Experimental example here was "polyominoes", where there are N=16 tiles, each with a 4 numbers (encoded with e.g. 6-bit binary numbers). The edge numbers determine how the tiles irreversibly bind, e.g. 1 <-> 2, 3 <-> 4 etc, with 4 and 2^6-1 binding to nothing.
  • These tiles are allowed to 'randomly' self-assemble. Some don't terminate (e.g. they form continuous polymers); these are discarded; others do terminate (no more available binding sites).
  • They assessed the complexity of both polyominoes selected for a particular size, eg 16 tiles, or those not selected at all, other than terminating.
  • In both complexity was assessed based on how many actual interactions were needed to make the observed structure. That is, they removed tile edge numbers and kept it if it affected the n-mer formation.
  • Result was this nice log-log plot:
  • Showed that this same trend holds for protein-protein complexes (weaker result, imho)
  • As well as RNA secondary structure
  • And metabolic time-series in a ODE modeled on yeast metabolism (even weaker result..)

The paper features a excellent set of references, including:
Letter to a friend following her article Machine learning in evolutionary studies comes of age

Read your PNAS article last night, super interesting that you can get statistical purchase on long-lost evolutionary 'sweeps' via GANs and other neural network models.  I feel like there is some sort of statistical power issue there?  DNNs are almost always over-parameterized... slightly suspicious.

This morning I was sleepily mulling things over & thought about a walking conversation that we had a long time ago in the woods of NC:  Why is evolution so effective?  Why does it seem to evolve to evolve?  Thinking more -- and having years more perspective -- it seems almost obvious in retrospect: it's a consequence of Bayes' rule.  Evolution finds solutions in spaces that have overwhelming prevalence of working solutions.  The prior has an extremely strong effect.  These representational / structural spaces by definition have many nearby & associated solutions, hence appear post-hoc 'evolvable'.  (You probably already know this.)

I think proteins very much fall into this category: AA were added to the translation machinery based on ones that happened to solve a particular problem... but because of the 'generalization prior' (to use NN parlance), they were useful for many other things.  This does not explain the human-engineering-like modularity of mature evolved systems, but maybe that is due to the strong simplicity prior [1]

Very very interesting to me is how the science of evolution and neural networks are drawing together, vis a vis the lottery ticket hypothesis.  Both evince a continuum of representational spaces, too, from high-dimensional vectoral (how all modern deep learning systems work) to low-dimensional modular, specific, and general (phenomenological human cognition).  I suspect that evolution uses a form of this continuum, as seen in the human high-dimensional long-range gene regulatory / enhancer network (= a structure designed to evolve).  Not sure how selection works here, though; it's hard to search a high-dimensional space.  The brain has an almost identical problem: it's hard to do 'credit assignment' in a billions-large, deep and recurrent network.  Finding which set of synapses caused a good / bad behaviior takes a lot of bits.

{1566}
hide / / print
ref: -1992 tags: evolution baldwin effect ackley artificial life date: 03-21-2022 23:20 gmt revision:0 [head]

Interactions between learning and evolution

  • Ran simulated evolution and learning on a population of agents over ~100k lifetimes.
  • Each agent can last several hundred timesteps with a gridworld like environment.
  • Said gridworld environment has plants (food), trees (shelter), carnivores, and other agents (for mating)
  • Agent behavior is parameterized by an action network and a evaluation network.
    • The action network transforms sensory input into actions
    • The evaluation network sets the valence (positive or negative) of the sensory signals
      • This evaluation network modifies the weights of the action network using a gradient-based RL algorithm called CRBP (complementary reinforcement back-propagation) which reinforces based on the temporal derivative, and complements (negative) when action does not increase reward, with some e-greedy exploration.
        • It's not perfect, but as they astutely say, any reinforcement learning algorithm involves some search, so generally heuristics are required to select new actions in the face of uncertainty.
      • Observe that it seems easier to make a good evaluation network than action network (evaluation network is lower dimensional -- one output!)
    • Networks are implemented as one-layer perceptrons (boring, but they had limited computational resources back then)
  • Showed (roughly) that in winner populations you get:
    • When learning is an option, the population will learn, and with time this will grow to anticipation / avoidance
    • This will transition to the Baldwin effect; learned behavior becomes instinctive
      • But, interestingly, only when the problem is incompletely solved!
      • If it's completely solved by learning (eg super fast), then there is no selective leverage on innate behavior over many generations.
      • Likewise, the survival problem to be solved needs to be stationary and consistent for long enough for the Baldwin effect to occur.
    • Avoidance is a form of shielding, and learning no longer matters on this behavior
    • Even longer term, shielding leads to goal regression: avoidance instincts allow the evaluation network to do something else, set new goals.
      • In their study this included goals such as approaching predators (!).

Altogether (historically) interesting, but some of these ideas might well have been anticipated by some simple hand calculations.

{1565}
hide / / print
ref: -0 tags: nvidia gpuburn date: 02-03-2022 20:27 gmt revision:6 [5] [4] [3] [2] [1] [0] [head]

Compiling a list of saturated matrix-matrix gflops for various Nvidia GPUs.

  1. GTX 1050 Mobile in a Lenovo Yoga 720
    1. ?? W
    2. 1640 Gflops/sec (float)
    3. 65 Gflops/sec (double)
    4. 2 GBb ram, 640 cores, ?? clock / (64C)
  2. T2000 in a Lenovo P1 Gen 3
    1. 34 W
    2. 2259 GFlops/sec (float)
    3. 4 Gb ram, 1024 cores, clock 1185 / 7000 MHz
  3. GTX 1650 Max-Q in a Lenovo X1 extreme Gen 3
    1. 35 W
    2. 2580 GFlops/sec (float)
    3. 116 GFlops/sec (double)
    4. 4 Gb ram, 1024 cores, clock 1335 (float) 1860 (double) / 10000 MHz (56C)
  4. RTX 3080 in a MSI Creator 17
    1. 80 W
    2. 5400 GFlops/sec (float)
    3. 284 GFlops/sec (double)
    4. 16 Gb ram, 6144 cores, clock 855 (float) 1755 (double) / 12000 MHz (68C)
      1. Notable power / thermal throttling on this laptop.
  5. EVGA RTX 2080Ti
    1. 260 W
    2. 11800 GFlops / sec (float)
    3. 469 GFlops / sec (double)
    4. 11 Gb ram, 4352 cores, clock 1620 (float) 1905 (double) / 13600 MHz (74C)

{1564}
hide / / print
ref: -2008 tags: t-SNE dimensionality reduction embedding Hinton date: 01-25-2022 20:39 gmt revision:2 [1] [0] [head]

“Visualizing data using t-SNE”

  • Laurens van der Maaten, Geoffrey Hinton.
  • SNE: stochastic neighbor embedding, Hinton 2002.
  • Idea: model the data conditional pairwise distribution as a gaussian, with one variance per data point, p(x i|x j) p(x_i | x_j)
  • in the mapped data, this pairwise distribution is modeled as a fixed-variance gaussian, too, q(y i|y j) q(y_i | y_j)
  • Goal is to minimize the Kullback-Leibler divergence Σ iKL(p i||q i) \Sigma_i KL(p_i || q_i) (summed over all data points)
  • Per-data point variance is found via binary search to match a user-specified perplexity. This amounts to setting a number of nearest neighbors, somewhere between 5 and 50 work ok.
  • Cost function is minimized via gradient descent, starting with a random distribution of points yi, with plenty of momentum to speed up convergence, and noise to effect simulated annealing.
  • Cost function is remarkably simple to reduce, gradient update: δCδy i=2Σ j(p j|iq ji+p i|jq i|j)(y iy j) \frac{\delta C}{\delta y_i} = 2 \Sigma_j(p_{j|i} - q_{j-i} + p_{i|j} - q_{i|j})(y_i - y_j)
  • t-SNE differs from SNE (above) in that it addresses difficulty in optimizing the cost function, and crowding.
    • Uses a simplified symmetric cost function (symmetric conditional probability, rather than joint probability) with simpler gradients
    • Uses the student’s t-distribution in the low-dimensional map q to reduce crowding problem.
  • The crowding problem is roughly resultant from the fact that, in high-dimensional spaces, the volume of the local neighborhood scales as r m r^m , whereas in 2D, it’s just r 2 r^2 . Hence there is cost-incentive to pushing all the points together in the map -- points are volumetrically closer together in high dimensions than they can be in 2D.
    • This can be alleviated by using a one-DOF student distribution, which is the same as a Cauchy distribution, to model the probabilities in map space.
  • Smart -- they plot the topology of the gradients to gain insight into modeling / convergence behavior.
  • Don’t need simulated annealing due to balanced attractive and repulsive effects (see figure).
  • Enhance the algorithm further by keeping it compact at the beginning, so that clusters can move through each other.
  • Look up: d-bits parity task by Bengio 2007

{1563}
hide / / print
ref: -0 tags: date: 01-09-2022 19:04 gmt revision:1 [0] [head]

The Sony Xperia XZ1 compact is a better phone than an Apple iPhone 12 mini

I don't normally write any personal options here -- just half-finished paper notes riddled with typos (haha) -- but this one has been bothering me for a while.

November 2020 I purchased an iPhone 12 mini to replace my aging Sony Xperia XZ1 compact. (Thinking of staying with Android, I tried out a Samsung S10e as well, but didn't like it.) Having owned and used the iPhone for a year and change, I still prefer the Sony. Here is why:

  • Touch screen
    • The iPhone is MUCH more sensitive to sweat than the Sony
    • This is the biggest problem, since I like to move (hike, bike, kayak etc), it lives in my pocket, and inevitably gets a bit of condensation or water on it.
    • The ipPhone screen is rendered frustrating to use with even an imperceptible bit of moisture on it.
      • Do iPhone users not sweat?
      • Frequently I can't even select the camera app! Or switch to maps!
        • A halfway fix is to turn the screen off then on again. Halfway.
    • The Sony, in comparison, is relatively robust, and works even to the point where there were droplets of water on it.
  • Size
    • They are both about the same size with a case, Sony is 129 x 65 x 9.3 mm ; iPhone mini is 131.5 x 64.2 x 7.4mm.
    • This size is absolutely perfect and manufacturers need to make more phones with these dimensions!
    • If anything, the iPhone is better here -- the rounded corners are nice.
  • Battery
    • Hands down, the Sony. Lasts >=2x as long as the iPhone.
  • Processor
    • Both are fast enough.
  • Software
    • Apple is not an ecosystem. No. It's a walled garden where a select few plants may grow. You do what Apple wants you to do.
      • E.g. want to use any Google apps on iPhone? No problem! Want to use any Apple apps on Android or web or PC? Nope, sorry, you have to buy a $$$ MacBook pro.
    • Ok, the privacy on an iPhone is nice. Modulo that bit where they scanned our photos.
      • As well as the ability to manage notifications & basically turn them all off :)
    • There are many more apps on Android, and they are much less restricted in what they can do.
      • For example, recently we were in the desert & wanted a map of where the cell signal was strong, for remote-working. This is easy on Android (there is an app for it).
        • This is impossible on iPhone (the apps don't have access to the information).
      • Second example, you can ssh into an Android and use that to download large files (e.g. packages, datasets) to avoid using limited tethering data.
        • This is also impossible on iPhone.
    • Why does iMessage make all texts from Android users yucky green? Why is there zero option to change this?
    • Why does iMessage send very low resolution photos to my friends and family using Android? It sends beautiful full-res photos to other Apple phones.
    • Why is there no web interface to iMessage?
      • Ugh, this iPhone is such an elitist snob.
    • You can double-tap on the square in Android to quickly switch between apps, which is great.
    • Apple noticeably auto-corrects to a smaller vocabulary than desired. Android is less invasive in this respect.
  • Cell signal
    • They are similarly unreliable, though the iPhone has 5G & many more wireless band, which is great.
    • Still, frequently I'll have one-two bars of connectivity & yet Google Maps will say "you are offline". This is much less frequent on the Sony.
  • Screen
    • iPhone screen is better.
  • Camera
    • iPhone camera is very very much better.
  • Speaker
    • iPhone speaker much better. But it sure burns the battery.
  • Wifi
    • iPhone will periodically disconnect from Wifi when on Facetime calls. Sony doesn't do this.
      • Facetime only works with Apple devices.
  • Price
    • Sony wins
  • Unlock
    • Face unlock is a cool idea, but we all wear masks now.
    • The Sony has a fingerprint sensor, which is better.
      • In the case where I'm moving (and possibly sweaty), Android is smart enough to allow quick unlock, for access to the camera app or maps. Great feature.

Summary: I'll try to get my moneys worth out of the iPhone; when it dies, will buy the smallest waterproof Android phone that supports my carrier's bands.

{1561}
hide / / print
ref: -0 tags: date: 01-09-2022 19:03 gmt revision:1 [0] [head]

Cortical response selectivity derives from strength in numbers of synapses

  • Benjamin Scholl, Connon I. Thomas, Melissa A. Ryan, Naomi Kamasawa & David Fitzpatrick
  • "Using electron microscopy reconstruction of individual synapses as a metric of strength, we find no evidence that strong synapses have a predominant role in the selectivity of cortical neuron responses to visual stimuli. Instead, selectivity appears to arise from the total number of synapses activated by different stimuli."
  • "Our results challenge the role of Hebbian mechanisms in shaping neuronal selectivity in cortical circuits, and suggest that selectivity reflects the co-activation of large populations of presynaptic neurons with similar properties and a mixture of strengths. "
    • Interesting -- so this is consistent with ANNs / feature detectors / vector hypothesis.
    • It would imply that the mapping is dense rather than sparse -- but to see this, you'd need to record the activity of all these synapses in realtime.
      • Which is possible, (e.g. lightbeads, fast axial focusing), just rather difficult for now.
  • To draw really firm conclusions, would need a thorough stimulus battery, not just drifting gratings.
    • It may change this result: "Surprisingly, the strength of individual synapses was uncorrelated with functional similarity to the somatic output (that is, absolute orientation preference difference)"

{842}
hide / / print
ref: work-0 tags: distilling free-form natural laws from experimental data Schmidt Cornell automatic programming genetic algorithms date: 12-30-2021 05:11 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

Distilling free-form natural laws from experimental data

  • The critical step was to use the full set of all pairs of partial derivatives ( δx/δy\delta x / \delta y ) to evaluate the search for invariants.
  • The selection of which partial derivatives are held to be independent / which variables are dependent is a bit of a trick too -- see the supplemental information.
    • Even yet, with a 4D data set the search for natural laws took ~ 30 hours.
  • This was via a genetic algorithm, distributed among 'islands' on different CPUs, with mutation and single-point crossover.
  • Not sure what the IL is, but it appears to be floating-point assembly.
  • Timeseries data is smoothed with Loess smoothing, which fits a polynomial to the data, and hence allows for smoother / more analytic derivative calculation.
    • Then again, how long did it take humans to figure out these invariants? (Went about it in a decidedly different way..)
    • Further, how long did it take for biology to discover similar 'design equations'?
      • The same algorithm has been applied to biological data - a metabolic pathway - with some success pub 2011.
      • Of course evolution had to explore a much larger space - proteins and regulatory pathways, not simpler mathematical expressions / linkages.


Since his Phd, Michael Schmidt has gone on to found Nutonian, which produced Eurequa software, apparently without dramatic new features other than being able to use the cloud for equation search. (Probably he improved many other detailed facets of the software..). Nutonian received $4M in seed funding, according to Crunchbase.

In 2017, Nutonian was acquired by Data Robot (for an undisclosed amount), where Michael has worked since, rising to the title of CTO.

Always interesting to follow up on the authors of these classic papers!

{1562}
hide / / print
ref: -0 tags: SAT solver blog post date: 12-30-2021 00:29 gmt revision:0 [head]

Modern SAT solvers: fast, neat and underused (part 1 of N)

A set of posts that are worth re-reading.

{1560}
hide / / print
ref: -2021 tags: synaptic imaging weights 2p oregon markov date: 12-29-2021 23:30 gmt revision:2 [1] [0] [head]

Distinct in vivo dynamics of excitatory synapses onto cortical pyramidal neurons and parvalbumin-positive interneurons

  • Joshua B.Melander, Aran Nayebi, Bart C.Jongbloets, Dale A.Fortin, Maozhen Qin, Surya Ganguli, Tianyi Mao, Haining Zhong
  • Cre-dependent mVenus labeled PSD-95, in both excitatory pyramidadl neurons & inhibitory PV interneurons.
  • morphology labeled with tdTomato
  • Longitudinal imaging of individual exictatoy post-synaptic densityies; estimated weight from fluorescence; examined spine appearance and disappearance
  • PV synapses were more stable over the 24-day period than synapses on pyramidal neurons.
  • Likewise, large synapses were more likely to remain over the imaging period.
  • Both followed log-normal distributions in 'strengths'
  • Changes were well modeled by a Markov process, which puts high probability on small changes.
  • But these changes are multiplicative (+ additive component in PV cells)

{1559}
hide / / print
ref: -0 tags: diffusion models image generation OpenAI date: 12-24-2021 05:50 gmt revision:0 [head]

Some investigations into denoising models & their intellectual lineage:

Deep Unsupervised Learning using Nonequilibrium Thermodynamics 2015

  • Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
  • Starting derivation of using diffusion models for training.
  • Verrry roughly, the idea is to destroy the structure in an image using diagonal Gaussian per-pixel, and train an inverse-diffusion model to remove the noise at each step. Then start with Gaussian noise and reverse-diffuse an image.
  • Diffusion can take 100s - 1000s of steps; steps are made small to preserve the assumption that the conditional probability, p(x t1|x t)N(0,I)p(x_{t-1}|x_t) \propto N(0, I)
    • The time variable here goes from 0 (uncorrupted data) to T (fully corrupted / Gaussian noise)

Generative Modeling by Estimating Gradients of the Data Distribution July 2019

  • Yang Song, Stefano Ermon

Denoising Diffusion Probabilistic Models June 2020

  • Jonathan Ho, Ajay Jain, Pieter Abbeel
  • A diffusion model that can output 'realistic' images (low FID / low log-likelihood )

Improved Denoising Diffusion Probabilistic Models Feb 2021

  • Alex Nichol, Prafulla Dhariwal
  • This is directly based on Ho 2020 and Shol-Dickstein 2015, but with tweaks
  • The objective is no longer the log-likelihood of the data given the parameters (per pixel); it's now mostly the MSE between the corrupting noise (which is known) and the estimated noise.
  • That is, the neural network model attempts, given x tx_t to estimate the noise which corrupted it, which then can be used to produce x t1x_{t-1}
    • Simpicity. Satisfying.
  • The also include a reweighted version of the log-likelihood loss, which puts more emphasis on the first few steps of noising. These steps are more important for NLL; reweighting also smooths the loss.
    • I think that, per Ho above, the simple MSE loss is sufficient to generate good images, but the reweighted LL improves the likelihood of the parameters.
  • There are some good crunchy mathematical details on how how exactly the the mean and variance of the estimated Gaussian distributions are handled -- at each noising step, you need to scale the mean down to prevent Brownian / random walk.
    • Taking these further, you can estimate an image at any point t in the forward diffusion chain. They use this fact to optimize the function approximator (a neural network; more later) using a (random but re-weighted/scheduled) t and the LL loss + simple loss.
  • Ho 2020 above treats the variance of the noising Gaussian as fixed -- that is, β \beta ; this paper improves the likelihood by adjusting the noise varaince mostly at the last steps by a ~β t~\beta_t , and then further allowing the function approximator to tune the variance (a multiplicative factor) per inverse-diffusion timestep.
    • TBH I'm still slightly foggy on how you go from estimating noise (this seems like samples, concrete) to then estimating variance (which is variational?). hmm.
  • Finally, they schedule the forward noising with a cosine^2, rather than a linear ramp. This makes the last phases of corruption more useful.
  • Because they have an explicit parameterization of the noise varaince, they can run the inverse diffusion (e.g. image generation) faster -- rather than 4000 steps, which can take afew minutes on a GPU, they can step up the variance and run it only for 50 steps and get nearly as good images.

Diffusion Models Beat GANs on Image Synthesis May 2021

  • Prafulla Dhariwal, Alex Nichol

In all of above, it seems that the inverse-diffusion function approximator is a minor player in the paper -- but of course, it's vitally important to making the system work. In some sense, this 'diffusion model' is as much a means of training the neural network as it is a (rather inefficient, compared to GANs) way of sampling from the data distribution. In Nichol & Dhariwal Feb 2021, they use a U-net convolutional network (e.g. start with few channels, downsample and double the channels until there are 128-256 channels, then upsample x2 and half the channels) including multi-headed attention. Ho 2020 used single-headed attention only at the 16x16 level. Ho 2020 in turn was based on PixelCNN++

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications Jan 2017

  • Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma

which is an improvement to (e.g. add selt-attention layers)

Conditional Image Generation with PixelCNN Decoders

  • Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu

Most recently,

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  • Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen

Added text-conditional generation + many more parameters + much more compute to yield very impressive image results + in-painting. This last effect is enabled by the fact that it's a full generative denoising probabilistic model -- you can condition on other parts of the image!

{1558}
hide / / print
ref: -2021 tags: hippocampal behavior scale plasticity Magee Romani Bittner date: 12-20-2021 22:39 gmt revision:0 [head]

Bidirectional synaptic plasticity rapidly modifies hippocampal representations

  • Normal Hebbian plasticity depends on pre and post synaptic activity & their time course.
  • Three-factor plasticity depends on pre, post, and neuromodulatory activity, typically formalized as an eligibility trace (ET) and instructive signal (IS).
  • Here they show that dendritic-plateau dependent hippocampal place field generation, in particular LTD, is not (quite so) dependent on post synaptic activity.
  • Instead, it appears to be a 'register update' operation, where a new pattern is remembered (through LTP) and an old pattern is forgotten (through LTD).
    • That is, the synapses are updating information, not accumulating information.
  • The eq for a single synapse: ΔW/δt=(W maxW)k +q +(ET*IS)Wk q (ET*IS)\Delta W / \delta t = (W_{max} - W) k^+ q^+(ET * IS) - W k^- q^-(ET * IS)
    • Where k are the learning rates, and q are the nonlinear functions regulating potentiation / depression based on eligibility trace and instructive signal.

I'm still not 100% sure that this excludes any influence on presynaptic activity ... they didn't control for that. But certainly LTD in their model does not require postsynaptic activity; indeed, it may only require net-synaptic homeostasis.

{1557}
hide / / print
ref: -0 tags: SVD vocabulary english latent vector space Plato date: 12-20-2021 22:27 gmt revision:1 [0] [head]

A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge

  • A whole lot of verbiage here for an old, important, but relatively straightforward result:
    • Take ~30k encyclopedia articles.
    • From them, make a vocabulary of ~ 60k words.
    • Form a sparse matrix with rows being the vocabulary word, and columns being the encyclopedia article.
    • Perform large, sparse SVD on this matrix.
      • How? He doesn't say.
    • Take the top 300 singular values & associated V vectors, and use these as an embedding space for vocabulary.
  • The 300-dim embedding can then be used to perform analysis to solve TOEFL synonym problems
    • Map the cue and the multiple choice query words to 300-dim space, and select the one with the highest cosine similarity.

The fact that sVD works at all, and pulls out some structure is interesting! Not nearly as good as word2vec.

{1556}
hide / / print
ref: -0 tags: concept net NLP transformers graph representation knowledge date: 11-04-2021 17:48 gmt revision:0 [head]

Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

  • From a team at University of Washington / Allen institute for artificial intelligence/
  • Courtesy of Yannic Kilcher's youtube channel.
  • General idea: use GPT-3 as a completion source given a set of prompts, like:
    • X starts running
      • So, X gets in shape
    • X and Y engage in an argument
      • So, X wants to avoid Y.
  • There are only 7 linkage atoms (edges, so to speak) in these queries, but of course many actions / direct objects.
    • These prompts are generated from the Atomic 20-20 human-authored dataset.
    • The prompts are fed into 175B parameter DaVinci model, resulting in 165k examples in the 7 linkages after cleaning.
    • In turn the 165k are fed into a smaller version of GPT-3, Curie, that generates 6.5M text examples, aka Atomic 10x.
  • Then filter the results via a second critic model, based on fine-tuned RoBERTa & human supervision to determine if a generated sentence is 'good' or not.
  • By throwing away 62% of Atomic 10x, they get a student accuracy of 96.4%, much better than the human-designed knowledge graph.
    • They suggest that one way thins works is by removing degenerate outputs from GPT-3.

Human-designed knowledge graphs are described here: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

And employed for profit here: https://www.luminoso.com/

{1549}
hide / / print
ref: -0 tags: gtk.css scrollbar resize linux qt5 date: 10-28-2021 18:47 gmt revision:3 [2] [1] [0] [head]

Put this in ~/.config/gtk-3.0/gtk.css make scrollbars larger on high-DPI screens. ref

.scrollbar {
  -GtkScrollbar-has-backward-stepper: 1;
  -GtkScrollbar-has-forward-stepper: 1;
  -GtkRange-slider-width: 16;
  -GtkRange-stepper-size: 16;
}
scrollbar slider {
    /* Size of the slider */
    min-width: 16px;
    min-height: 16px;
    border-radius: 16px;

    /* Padding around the slider */
    border: 2px solid transparent;
}

.scrollbar.vertical slider,
scrollbar.vertical slider {
    min-height: 16px;
    min-width: 16px;
}

.scrollbar.horizontal.slider,
scrollbar.horizontal slider {
min-width: 16px;
min-height: 16px;
}

/* Scrollbar trough squeezes when cursor hovers over it. Disabling that
 */

.scrollbar.vertical:hover:dir(ltr),
.scrollbar.vertical.dragging:dir(ltr) {
    margin-left: 0px;
}

.scrollbar.vertical:hover:dir(rtl),
.scrollbar.vertical.dragging:dir(rtl) {
    margin-right: 0px;
}

.scrollbar.horizontal:hover,
.scrollbar.horizontal.dragging,
.scrollbar.horizontal.slider:hover,
.scrollbar.horizontal.slider.dragging {
    margin-top: 0px;
}
undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { background-image: none; }
undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { background-image: none; }

Also add:

export GTK_OVERLAY_SCROLLING=0 
to your ~/.bashrc

To make the scrollbars a bit easier to see in QT5 applications, run qt5ct (after apt-getting it), and add in a new style sheet, /usr/share/qt5ct/qss/scrollbar-simple-backup.qss

/* SCROLLBARS (NOTE: Changing 1 subcontrol means you have to change all of them)*/
QScrollBar{
  background: palette(alternate-base);
}
QScrollBar:horizontal{
  margin: 0px 0px 0px 0px;
}
QScrollBar:vertical{
  margin: 0px 0px 0px 0px;
}
QScrollBar::handle{
  background: #816891;
  border: 1px solid transparent;
  border-radius: 1px;
}
QScrollBar::handle:hover, QScrollBar::add-line:hover, QScrollBar::sub-line:hover{
  background: palette(highlight);
}
QScrollBar::add-line{
subcontrol-origin: none;
}
QScrollBar::add-line:vertical, QScrollBar::sub-line:vertical{
height: 0px;
}
QScrollBar::add-line:horizontal, QScrollBar::sub-line:horizontal{
width: 0px;
}
QScrollBar::sub-line{
subcontrol-origin: none;
}

{1555}
hide / / print
ref: -0 tags: adaptive optics two photon microscopy date: 10-26-2021 18:17 gmt revision:1 [0] [head]

Recently I've been underwhelmed by the performance of adaptive optics (AO) for imaging head-fixed cranial-window mice. There hasn't been much of an improvement, despite significant optimization effort. This begs the question: where are AO microscopes used?

When the purpose of a paper is to explain and qualify an novel AO approach, the improvement is always good, >> 2x. Yet, in the one paper (first below) when the purpose was neuroscience, not optics, the results are less inspiring. Are the results from the optics papers cherry-picked?

Thalamus provides layer 4 of primary visual cortex with orientation- and direction-tuned inputs Wenzhi Sun, Zhongchao Tan, Brett D Mensh & Na Ji 2016 https://www.nature.com/articles/nn.4196

  • This is the primary (only?) paper where AO was used, but the focus was biology: measuring the tuning properties of thalamic boutons in mouse visual cortex. Which they did, well!
  • Surprisingly, the largest improvement was not from using AO, but rather from thinning the cranial window from 340um to 170um.
  • "With a 340-μm-thick cranial window, 70% of all boutons appeared to be non-responsive to visual stimuli and only 7% satisfied OS criteria. With a thinner cranial window of 170-μm thickness, we found that 31% of boutons satisfied OS criteria (of total n = 1,302, 5 mice), which was still substantially fewer than 48% OS boutons as determined when the same boutons (n = 1,477, 5 mice) were imaged after aberration correction by adaptive optics"

Direct wavefront sensing for high-resolution in vivo imaging in scattering tissue Kai Wang, Wenzhi Sun, Christopher T. Richie, Brandon K. Harvey, Eric Betzig & Na Ji, 2015 https://www.nature.com/articles/ncomms8276

  • Direct wavefront sensing using indocayanine green + Andor iXon 897 EMCCD Shack-Hartmann wavefront sensor (read: expensive).
  • Alpao DM97-15, basically the same as ours.
  • Fairly local wavefront corrections, see figure 2.
  • Also note that these wavefront corrections seem low-order, hence should be correctable via a DM

Multiplexed aberration measurement for deep tissue imaging in vivo Chen Wang, Rui Liu, Daniel E Milkie, Wenzhi Sun, Zhongchao Tan, Aaron Kerlin, Tsai-Wen Chen, Douglas S Kim & Na Ji 2014 https://www.nature.com/articles/nmeth.3068

  • Use a DMD (including a dispersion pre-compensator) to amplitude modulate phase ramps on a wavefront-modulating SLM. Each phase-ramp segment of the SLM was modulated at a different frequency, allowing for the optimal phase to be pulled out later through a Fourier transform.
  • Again, very good performance at depth in the mouse brain.

{1554}
hide / / print
ref: -2021 tags: FIBSEM electron microscopy presynaptic plasticity activity Funke date: 10-12-2021 17:03 gmt revision:0 [head]

Ultrastructural readout of in vivo synaptic activity for functional connectomics

  • Anna Simon, Arnd Roth, Arlo Sheridan, Mehmet Fişek, Vincenzo Marra, Claudia Racca, View ORCID ProfileJan Funke, View ORCID ProfileKevin Staras, Michael Häusser
  • Did FIB-SEM on FM1-43 dye labeled synapses, then segmented the cells using machine learning, as Jan has pioneered.
    • FM1-43FX is membrane impermeable, and labels only synaptic vesicles that have been recycled after dye loading. (Invented in 1992!)
    • FM1-43FX is also able to photoconvert diaminobenzidene (DAB) into a amorphous highly conjugated polymer with high affinity for osmium tetroxide
  • This allows for a snapshot of ultrastructural presynaptic plasticity / activity.
  • N=84 boutons, but n=7 pairs / triples of boutons from the same axon.
    • These boutons have the same presynaptic spiking activity, and hence are expected to have the same release probability, and hence the same photoconversion (PC) labeling.
      • But they don't! The ratio of PC+ vesicle numbers between boutons on the same neuron is low, mean < 0.4, which suggests some boutons have high neurotransmitter release and recycling, others have low...
  • Quote in the abstract: We also demonstrate that neighboring boutons of the same axon, which share the same spiking activity, can differ greatly in their presynaptic release probability.
    • Well, sorta, the data here is a bit weak. It might all be lognormal fluctuations, as has been well demonstrated.
    • When I read it I was excited to think of the influence of presynaptic inhibition / modulation, which has not been measured here, but is likely to be important.

{1529}
hide / / print
ref: -2020 tags: dreamcoder ellis program induction ai tenenbaum date: 10-10-2021 17:32 gmt revision:2 [1] [0] [head]

DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning

  • Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, Joshua B. Tenenbaum

This paper describes a system for adaptively finding programs which succinctly and accurately produce desired output. These desired outputs are provided by the user / test system, and come from a number of domains:

  • list (as in lisp) processing,
  • text editing,
  • regular expressions,
  • line graphics,
  • 2d lego block stacking,
  • symbolic regression (ish),
  • functional programming,
  • and physcial laws.
Some of these domains are naturally toy-like, eg. the text processing, but others are deeply impressive: the system was able to "re-derive" basic physical laws of vector calculus in the process of looking for S-expression forms of cheat-sheet physics equations. These advancements result from a long lineage of work, perhaps starting from the Helmholtz machine PMID-7584891 introduced by Peter Dayan, Geoff Hinton and others, where onemodel is trained to generate patterns given context (e.g.) while a second recognition module is trained to invert this model: derive context from the patterns. The two work simultaneously to allow model-exploration in high dimensions.

Also in the lineage is the EC2 algorithm, which most of the same authors above published in 2018. EC2 centers around the idea of "explore - compress" : explore solutions to your program induction problem during the 'wake' phase, then compress the observed programs into a library by extracting/factoring out commonalities during the 'sleep' phase. This of course is one of the core algorithms of human learning: explore options, keep track of both what worked and what didn't, search for commonalities among the options & their effects, and use these inferred laws or heuristics to further guide search and goal-setting, thereby building a buffer attack the curse of dimensionality. Making the inferred laws themselves functions in a programming library allows hierarchically factoring the search task, making exploration of unbounded spaces possible. This advantage is unique to the program synthesis approach.

This much is said in the introduction, though perhaps with more clarity. DreamCoder is an improved, more-accessible version of EC2, though the underlying ideas are the same. It differs in that the method for constructing libraries has improved through the addition of a powerful version space for enumerating and evaluating refactors of the solutions generated during the wake phase. (I will admit that I don't much understand the version space system.) This version space allows DreamCoder to collapse the search space for re-factorings by many orders of magnitude, and seems to be a clear advancement. Furthermore, DreamCoder incorporates a second phase of sleep: "dreaming", hence the moniker. During dreaming the library is used to create 'dreams' consisting of combinations of the library primitives, which are then executed with training data as input. These dreams are then used to train up a neural network to predict which library and atomic objects to use in given contexts. Context in this case is where in the parse tree a given object has been inserted (it's parent and which argument number it sits in); how the data-context is incorporated to make this decision is not clear to me (???).

This neural dream and replay-trained neural network is either a GRU recurrent net with 64 hidden states, or a convolutional network feeding into a RNN. The final stage is a linear ReLu (???) which again is not clear how it feeds into the prediction of "which unit to use when". The authors clearly demonstrate that the network, or the probabalistic context-free grammar that it controls (?) is capable of straightforward optimizations, like breaking symmetries due to commutativity, avoiding adding zero, avoiding multiplying by one, etc. Beyond this, they do demonstrate via an ablation study that the presence of the neural network affords significant algorithmic leverage in all of the problem domains tested. The network also seems to learn a reasonable representation of the sub-type of task encountered -- but a thorough investigation of how it works, or how it might be made to work better, remains desired.

I've spent a little time looking around the code, which is a mix of python high-level experimental control code, and lower-level OCaml code responsible for running (emulating) the lisp-like DSL, inferring type in it's polymorphic system / reconciling types in evaluated program instances, maintaining the library, and recompressing it using aforementioned version spaces. The code, like many things experimental, is clearly a work-in progress, with some old or unused code scattered about, glue to run the many experiments & record / analyze the data, and personal notes from the first author for making his job talks (! :). The description in the supplemental materials, which is satisfyingly thorough (if again impenetrable wrt version spaces), is readily understandable, suggesting that one (presumably the first) author has a clear understanding of the system. It doesn't appear that much is being hidden or glossed over, which is not the case for all scientific papers.


With the caveat that I don't claim to understand the system to completion, there are some clear areas where the existing system could be augmented further. The 'recognition' or perceptual module, which guides actual synthesis of candidate programs, realistically can use as much information as is available in DreamCoder as is available: full lexical and semantic scope, full input-output specifications, type information, possibly runtime binding of variables when filling holes. This is motivated by the way that humans solve problems, at least as observed by introspection:
  • Examine problem, specification; extract patterns (via perceptual modules)
  • Compare patterns with existing library (memory) of compositionally-factored 'useful solutions' (this is identical to the library in DreamCoder)* Do something like beam-search or quasi stochastic search on selected useful solutions. This is the same as DreamCoder, however human engineers make decisions progressively, at runtime so-to-speak: you fill not one hole per cycle, but many holes. The addition of recursion to DreamCoder, provided a wider breadth of input information, could support this functionality.
  • Run the program to observe input-output .. but also observe the inner workings of the program, eg. dataflow patterns. These dataflow patterns are useful to human engineers when both debugging and when learning-by-inspection what library elements do. DreamCoder does not really have this facility.
  • Compare the current program results to the desired program output. Make a stochastic decision whether to try to fix it, or to try another beam in the search. Since this would be on a computer, this could be in parallel (as DreamCoder is); the ability to 'fix' or change a DUT is directly absent dreamcoder. As an 'deeply philosophical' aside, this loop itself might be the effect of running a language-of-thought program, as was suggested by pioneers in AI (ref). The loop itself is subject to modification and replacement based on goal-seeking success in the domain of interest, in a deeply-satisfying and deeply recursive manner ...
At each stage in the pipeline, the perceptual modules would have access to relevant variables in the current problem-solving context. This is modeled on Jacques Pitrat's work. Humans of course are even more flexible than that -- context includes roughly the whole brain, and if anything we're mushy on which level of the hierarchy we are working.

Critical to making this work is to have, as I've written in my notes many years ago, a 'self compressing and factorizing memory'. The version space magic + library could be considered a working example of this. In the realm of ANNs, per recent OpenAI results with CLIP and Dall-E, really big transformers also seem to have strong compositional abilities, with the caveat that they need to be trained on segments of the whole web. (This wouldn't be an issue here, as Dreamcoder generates a lot of its own training data via dreams). Despite the data-inefficiency of DNN / transformers, they should be sufficient for making something in the spirit of above work, with a lot of compute, at least until more efficient models are available (which they should be shortly; see AlphaZero vs MuZero).

{1553}
hide / / print
ref: -2020 tags: excitatory inhibitory balance E-I synapses date: 10-06-2021 17:50 gmt revision:1 [0] [head]

Whole-Neuron Synaptic Mapping Reveals Spatially Precise Excitatory/Inhibitory Balance Limiting Dendritic and Somatic Spiking

We mapped over 90,000 E and I synapses across twelve L2/3 PNs and uncovered structured organization of E and I synapses across dendritic domains as well as within individual dendritic segments. Despite significant domain-specific variation in the absolute density of E and I synapses, their ratio is strikingly balanced locally across dendritic segments. Computational modeling indicates that this spatially precise E/I balance dampens dendritic voltage fluctuations and strongly impacts neuronal firing output.

I think this would be tenuous, but they did do patch-clamp recording to back it up, but it's vitally interesting from a structural standpoint. Plus, this is a enjoyable, well-written paper :-)

{1544}
hide / / print
ref: -2019 tags: HSIC information bottleneck deep learning backprop gaussian kernel date: 10-06-2021 17:23 gmt revision:5 [4] [3] [2] [1] [0] [head]

The HSIC Bottleneck: Deep learning without Back-propagation

In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set per-layer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbert-schmidt independence criterion, as the independence measure.

The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the layer representation and the labels while minimizing the mutual information between the representation and the input:

minP T i|XI(X;T i)βI(T i;Y)\frac{min}{P_{T_i | X}} I(X; T_i) - \beta I(T_i; Y)

Where T iT_i is the hidden representation at layer i (later output), XX is the layer input, and YY are the labels. By replacing I()I() with the HSIC, and some derivation (?), they show that

HSIC(D)=(m1) 2tr(K XHK YH)HSIC(D) = (m-1)^{-2} tr(K_X H K_Y H)

Where D=(x 1,y 1),...(x m,y m)D = {(x_1,y_1), ... (x_m, y_m)} are samples and labels, K X ij=k(x i,x j)K_{X_{ij}} = k(x_i, x_j) and K Y ij=k(y i,y j)K_{Y_{ij}} = k(y_i, y_j) -- that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, k(x,y)=exp(1/2||xy|| 2/σ 2)k(x,y) = exp(-1/2 ||x-y||^2/\sigma^2) . So, if all the x and y are on average independent, then the inner-product will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices.

But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outer-product spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this...

For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks albeit in a much less intelligible way.


Robust Learning with the Hilbert-Schmidt Independence Criterion

Is another, later, paper using the HSIC. Their interpretation: "This loss-function encourages learning models where the distribution of the residuals between the label and the model prediction is statistically independent of the distribution of the instances themselves." Hence, given above nomenclature, E X(P T i|XI(X;T i))=0 E_X( P_{T_i | X} I(X ; T_i) ) = 0 (I'm not totally sure about the weighting, but might be required given the definition of the HSIC.)

As I understand it, the HSIC loss is a kernellized loss between the input, output, and labels that encourages a degree of invariance to input ('covariate shift'). This is useful, but I'm unconvinced that making the layer output independent of the input is absolutely essential (??)