use https for features.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -2020 tags: replay hippocampus variational autoencoder date: 10-11-2020 04:09 gmt revision:1 [0] [head]

Brain-inspired replay for continual learning with artificial neural networks

  • Gudo M van de Ven, Hava Siegelmann, Andreas Tolias
  • In the real world, samples are not replayed in shuffled order -- they occur in a sequence, typically few times. Hence, for training an ANN (or NN?), you need to 'replay' samples.
    • Perhaps, to get at hidden structure not obvious on first pass through the sequence.
    • In the brain, reactivation / replay likely to stabilize memories.
      • Strong evidence that this occurs through sharp-wave ripples (or the underlying activity associated with this).
  • Replay is also used to combat a common problem in training ANNs - catastrophic forgetting.
    • Generally you just re-sample from your database (easy), though in real-time applications, this is not possible.
      • It might also take a lot of memory (though that is cheap these days) or violate privacy (though again who cares about that)

  • They study two different classification problems:
    • Task incremental learning (Task-IL)
      • Agent has to serially learn distinct tasks
      • OK for Atari, doesn't make sense for classification
    • Class incremental learning (Class-IL)
      • Agent has to learn one task incrementally, one/few classes at a time.
      • Like learning a 2 digits at a time in MNIST
        • But is tested on all digits shown so far.
  • Solved via Generative Replay (GR, ~2017)
  • Use a recursive formulation: 'old' generative model is used to generate samples, which are then classified and fed, interleaved with the new samples, to the new network being trained.
    • 'Old' samples can be infrequent -- it's easier to reinforce an existing memory rather than create a new one.
    • Generative model is a VAE.
  • Compared with some existing solutions to catastrophic forgetting:
    • Methods to protect parameters in the network important for previous tasks
      • Elastic weight consolidation (EWC)
      • Synaptic intelligence (SI)
        • Both methods maintain estimates of how influential parameters were for previous tasks, and penalize changes accordingly.
        • "metaplasticity"
        • Synaptic intelligence: measure the loss change relative to the individual weights.
        • δL=δLδθδθδtδt \delta L = \int \frac{\delta L}{\delta \theta} \frac{\delta \theta}{\delta t} \delta t ; converted into discrete time / SGD: L=Σ kω k=ΣδLδθδθδtδt L = \Sigma_k \omega_k = \Sigma \int \frac{\delta L}{\delta \theta} \frac{\delta \theta}{\delta t} \delta t
        • ω k\omega_k are then the weightings for how much parameter change contributed to the training improvement.
        • Use this as a per-parameter regularization strength, scaled by one over the square of 'how far it moved'.
        • This is added to the loss, so that the network is penalized for moving important weights.
    • Context-dependent gating (XdG)
      • To reduce interference between tasks, a random subset of neurons is gated off (inhibition), depending on the task.
    • Learning without forgetting (LwF)
      • Method replays current task input after labeling them (incorrectly?) using the model trained on the previous tasks.
  • Generative replay works on Class-IL!
  • And is robust -- not to many samples or hidden units needed (for MNIST)

  • Yet the generative replay system does not scale to CIFAR or permuted MNIST.
  • E.g. if you take the MNIST pixels, permute them based on a 'task', and ask a network to still learn the character identities , it can't do it ... though synaptic intelligence can.
  • Their solution is to make 'brain-inspired' modifications to the network:
    • RtF, Replay-though-feedback: the classifier and generator network are fused. Latent vector is the hippocampus. Cortex is the VAE / classifier.
    • Con, Conditional replay: normal prior for the VAE is replaced with multivariate class-conditional Gaussian.
      • Not sure how they sample from this, check the methods.
    • Gat, Gating based on internal context.
      • Gating is only applied to the feedback layers, since for classification ... you don't a priori know the class!
    • Int, Internal replay. This is maybe the most interesting: rather than generating pixels, feedback generates hidden layer activations.
      • First layer of a network is convolutional, dependent on visual feature statistics, and should not change much.
        • Indeed, for CIFAR, they use pre-trained layers.
      • Internal replay proved to be very important!
    • Dist, Soft target labeling of the generated targets; cross-entropy loss when training the classifier on generated samples. Aka distillation.
  • Results suggest that regularization / metaplasticity (keeping memories in parameter space) and replay (keeping memories in function space) are complementary strategies,
    • And that the brain uses both to create and protect memories.

  • When I first read this paper, it came across as a great story -- well thought out, well explained, a good level of detail, and sufficiently supported by data / lesioning experiments.
  • However, looking at the first authors pub record, it seems that he's been at this for >2-3 years ... things take time to do & publish.
  • Folding in of the VAE is satisfying -- taking one function approximator and use it to provide memory for another function approximator.
  • Also satisfying are the neurological inspirations -- and that full feedback to the pixel level was not required!
    • Maybe the hippocampus does work like this, providing high-level feature vectors to the cortex.
    • And it's likely that the cortex has some features of a VAE, e.g. able to perceive and imagine through the same nodes, just run in different directions.
      • The fact that both concepts led to an engineering solution is icing on the cake!

hide / / print
ref: -0 tags: tennenbaum compositional learning character recognition one-shot learning date: 09-29-2020 03:44 gmt revision:1 [0] [head]

One-shot learning by inverting a compositional causal process

  • Brenden Lake, Russ Salakhutdinov, Josh Tennenbaum
  • This is the paper that preceded the 2015 Science publication "Human level concept learning through probabalistic program induction"
  • Because it's a NIPS paper, and not a science paper, this one is a bit more accessible: the logic to the details and developments is apparent.
  • General idea: build up a fully probabilistic model of multi-language (omniglot corpus) characters / tokens. This model includes things like character type / alphabet, number of strokes, curvature of strokes (parameterized via bezier splines), where strokes attach to others
(spatial relations), stroke scale, and character scale. The model (won't repeat the formal definition) is factorized to be both compositional and causal, though all the details of the conditional probs are likely left to the supplemental material.
  • They fit the complete model to the Omniglot data using gradient descent + image-space noising, e.g tweak the free parameters of the model to generate images that look like the human created characters. (This too is in the supplement).
  • Because the model is high-dimensional and hard to invert, they generate a perceptual model by winnowing down the image into a skeleton, then breaking this into a variable number of strokes.
    • The probabilistic model then assigns a log-likelihood to each of the parses.
    • They then use the model with Metropolis-Hastings MCMC to sample a region in parameter space around each parse -- but they sample ψ\psi (the character type) to get a greater weighted diversity of types.
      • Surprisingly, they don't estimate the image likelihood - which is expensive - they here just re-do the parsing based on aggregate info embedded in the statistical model. Clever.
  • ψ\psi is the character type (a, b, c..), ψ=κ,S,R\psi = { \kappa, S, R } where kappa are the number of strokes, S is a set of parameterized strokes, R are the relations between strokes.
  • θ\theta are the per-token stroke parameters.
  • II is the image, obvi.
  • Classification task: one image of a new character (c) vs 20 characters new characters from the same alphabet (test, (t)). In the 20 there is one character of the same type -- task is to find it.
  • With 'hierarchical bayesian program learning', they not only anneal the type to the parameters (with MCMC, above) for the test image, but they also fit the parameters using gradient descent to the image.
    • Subsequently parses the test image onto the class image (c)
    • Hence the best classification is the one where both are in the best agreement: argmaxcP(c|t)P(c)P(t|c)\underset{c}{argmax} \frac{P(c|t)}{P(c)} P(t|c) where P(c)P(c) is approximated as the parse weights.
      • Again, this is clever as it allows significant information leakage between (c) and (t) ...
      • The other models (Affine, Deep Boltzman Machines, Hierarchical Deep Model) have nothing like this -- they are feed-forward.
  • No wonder HBPL performs better. It's a better model of the data, that has a bidirectional fitting routine.

  • As i read the paper, had a few vague 'hedons':
    • Model building is essential. But unidirectional models are insufficient; if the models include the mechanism for their own inversion many fitting and inference problems are solved. (Such is my intuition)
      • As a corrolary of this, having both forward and backward tags (links) can be used to neatly solve the binding problem. This should be easy in a computer w/ pointers, though in the brain I'm not sure how it might work (?!) without some sort of combinatorial explosion?
    • The fitting process has to be multi-pass or at least re-entrant. Both this paper and the Vicarious CAPTCHA paper feature statistical message passing to infer or estimate hidden explanatory variables. Seems correct.
    • The model here includes relations that are conditional on stroke parameters that occurred / were parsed beforehand; this is very appealing in that the model/generator/AI needs to be flexibly re-entrant to support hierarchical planning ...

hide / / print
ref: -2017 tags: schema networks reinforcement learning atari breakout vicarious date: 09-29-2020 02:32 gmt revision:2 [1] [0] [head]

Schema networks: zero-shot transfer with a generative causal model of intuitive physics

  • Like a lot of papers, the title has more flash than the actual results.
  • Results which would be state of the art (as of 2017) in playing Atari breakout, then transferring performance to modifications of the game (paddle moved up a bit, wall added in the middle of the bricks, brick respawning, juggling).
  • Schema network is based on 'entities' (objects) which have binary 'attributes'. These attributes can include continuous-valued signals, in which case each binary variable is like a place fields (i think).
    • This is clever an interesting -- rather than just low-level features pointing to high-level features, this means that high-level entities can have records of low-level features -- an arrow pointing in the opposite direction, one which can (also) be learned.
    • The same idea is present in other Vicarious work, including the CAPTCHA paper and more-recent (and less good) Bio-RNN paper.
  • Entities and attributes are propagated forward in time based on 'ungrounded schemas' -- basically free-floating transition matrices. The grounded schemas are entities and action groups that have evidence in observation.
    • There doesn't seem to be much math describing exactly how this works; only exposition. Or maybe it's all hand-waving over the actual, much simpler math.
      • Get the impression that the authors are reaching to a level of formalism when in fact they just made something that works for the breakout task... I infer Dileep prefers the empirical for the formal, so this is likely primarily the first author.
  • There are no perceptual modules here -- game state is fed to the network directly as entities and attributes (and, to be fair, to the A3C model).
  • Entity-attributes vectors are concatenated into a column vector length NTNT , where NN are the number of entities, and TT are time slices.
    • For each entity of N over time T, a row-vector is made of length MRMR , where MM are the number of attributes (fixed per task) and R1R-1 are the number of neighbors in a fixed radius. That is, each entity is related to its neighbors attributes over time.
    • This is a (large, sparse) binary matrix, XX .
  • yy is the vector of actions; task is to predict actions from XX .
    • How is X learned?? Very unclear in the paper vs. figure 2.
  • The solution is approximated as y=XW1¯y = X W \bar{1 } where WW is a binary weight matrix.
    • Minimize the solution based on an objective function on the error and the complexity of ww .
    • This is found via linear programming relaxation. "This procedure monotonically decreases the prediction error of the overall schema network, while increasing its complexity".
      • As it's a issue of binary conjunctions, this seems like a SAT problem!
    • Note that it's not probabilistic: "For this algorithm to work, no contradictions can exist in the input data" -- they instead remove them!
  • Actual behavior includes maximum-product belief propagation, to look for series of transitions that set the reward variable without setting the fail variable.
    • Because the network is loopy, this has to occur several times to set entity variables eg & includes backtracking.

  • Have there been any further papers exploring schema networks? What happened to this?
  • The later paper from Vicarious on zero-shot task transfer are rather less interesting (to me) than this.

hide / / print
ref: -2005 tags: dimensionality reduction contrastive gradient descent date: 09-13-2020 02:49 gmt revision:2 [1] [0] [head]

Dimensionality reduction by learning and invariant mapping

  • Raia Hadsell, Sumit Chopra, Yann LeCun
  • Central idea: learn and invariant mapping of the input by minimizing mapped distance (e.g. the distance between outputs) when the samples are categorized as the same (same numbers in MNIST eg), and maximizing mapped distance when the samples are categorized as distant.
    • Two loss functions for same vs different.
  • This is an attraction-repulsion spring analogy.
  • Use gradient descent to change the weights to satisfy these two competing losses.
  • Resulting constitutional neural nets can extract camera pose information from the NORB dataset.
  • Surprising how simple analogies like this, when iterated across a great many samples, pull out intuitively correct invariances.

hide / / print
ref: -2004 tags: neural synchrony binding robot date: 09-13-2020 02:00 gmt revision:0 [head]

PMID-15142952 Visual binding through reentrant connectivity and dynamic synchronization in a brain-based device

  • Controlled a robot with a complete (for the time) model of the occipital-inferotemporal visual pathway (V1 V2 V4 IT), auditory cortex, colliculus, 'value cortex'.
  • Synapses had a timing-dependent assoicative BCM learning rule
  • Robot had reflexes to orient toward preferred auditory stimuli
  • Subsequently, robot 'learned' to orient toward a preferred stimuli (e.g. one that caused orientation).
  • Visual stimuli were either diamonds or squares, either red or green.
    • Discrimination task could have been carried out by (it seems) one perceptron layer.
  • This was 16 years ago, and the results look quaint compared to the modern deep-learning revolution. That said, 'the binding problem' is imho still outstanding or at least interesting. Actual human perception is far more compositional than a deep CNN can support.

hide / / print
ref: -2020 tags: Neuralink commentary BMI pigs date: 08-31-2020 18:01 gmt revision:1 [0] [head]

Neuralink progress update August 28 2020

Some commentary.

The good:

  • Ian hit the nail on the head @ 1:05:47. That is not a side-benefit -- that was the original and true purpose. Thank you.
  • The electronics, amplify / record / sort / stim ASIC, as well as interconnect all advance the state of the art in density, power efficiency, and capability. (I always liked higher sampling rates, but w/e)
  • Puck is an ideal form factor, again SOTA. 25mm diameter craniotomy should give plenty of space for 32 x 32-channel depth electrodes (say).
  • I would estimate that the high-density per electrode feed-through is also SOTA, but it might also be a non-hermetic pass-through via the thin-film (e.g. some water vapor diffusion along the length of the polyimide (if that polymer is being used)).
  • Robot looks nice dressed in those fancy robes. Also looks like there is a revolute joint along the coronal axis.
  • Stim on every channel is cool.
  • Pigs seem like an ethical substitute for monkeys.

The mixed:

  • Neurons are not wires.
  • $2000 outpatient neurosurgery?! Will need to address the ~3% complication rate for most neurosurgery.
  • Where is the monkey data? Does it not work in monkeys? Insufficient longevity or yield? Was it strategic to not mention any monkeys, to avoid bad PR or the wrath of PETA?
    • I can't imagine getting into humans without demonstrating both safety and effectiveness on monkeys. Pigs are fine for the safety part, but monkeys are the present standard for efficacy.
  • How long do the electrodes last in pigs? What is the recording quality? How stable are the traces?
    • Judging from the commentary, assume this is a electrode material problem? What does Neuralink do if they are not significantly different in yield and longevity than the Utah array? (The other problems might well be easier than this one.)
      • That said, a thousand channels of EMG should be sufficient for some of the intended applications (below).
    • It really remains to be seen how well the brain tolerates these somewhat-large somewhat-thin electrodes, what percentage of the brain is disrupted in the process of insertion, and how much of the disruption is transient / how much is irrecoverable.
    • Pig-snout somatosensory cortex is an unusual recording location, making comparison difficult, but what was shown seemed rather correlated (?) We'd have to read an actual scientific publication to evaluate.
  • This slide is deceptive, as not all the applications are equally .. applicable. You don't need an extracellular ephys device to solve these problems that "almost everyone" will encounter over the course of their lives.
    • Memory loss -- Probably better dealt with via cellular / biological therapies, or treating the causes (stroke, infection, inflammation, neuroendocrine or neuromodulatory disregulation)
    • Hearing loss -- Reasonable. Nice complement to improved cochlear implants too. (Maybe the Neuralink ASIC could be used for that, too).
      • With this and the other reasonable applications, best to keep in context that stereo EEG, which is fairly disruptive w/ large probes, is well tolerated in epilepsy patients. (It has unclear effect on IQ or memory, but still, the sewing machine should be less invasive.)
    • Blindness -- Reasonable. Mating the puck to a Second Sight style thin film would improve channel count dramatically, and be less invasive. Otherwise you have to sew into the calcarine fissure, destroying a fair bit of cortex in the process & possibly hitting an artery or sulcal vein.
    • Paralysis -- Absolutely. This application is well demonstrated, and the Neuralink device should be able to help SCI patients. Presumably this will occupy them for the next five years; other applications would be a distraction.
      • Being able to sew flexible electrodes into the spinal cord is a great application.
    • Depression -- Need deeper targets for this. Research to treat depression via basal ganglia stim is ongoing; no reason it could not be mated to the Neuralink puck + long electrodes.
    • Insomina -- I guess?
    • Extreme pain -- Simpler approaches are likely better, but sure?
    • Seizures -- Yes, but note that Neuropace burned through $250M and wasn't significantly better than sham surgery. Again, likely better dealt with biologically: recombinant ion channels, glial or interneuron stem cell therapy.
    • Anxiety -- maybe? Designer drugs seem safer. Or drugs + CBT. Elon likes root causes: spotlight on the structural ills of our society.
    • Addiction -- Yes. It seems possible to rewire the brain with the right record / stim strategy, via for example a combination of DBS and cortical recording. Social restructuring is again a better root-cause fix.
    • Strokes -- No, despite best efforts, the robot causes (small) strokes.
    • Brain Damage -- Insertion of electrodes causes brain damage. Again, better dealt with via cellular (e.g. stem cells) or biological approaches.
      • This, of course, will take time as our understanding of brain development is limited; the good thing is that sufficient guidance signals remain in the adult brain, so AFAIK it's possible. From his comments, seems Alan's attitude is more aligned with this.
    • Not really bad per-se, but right panel could be better. I assume this was a design decision trade-off between working distance, NA, illumination, and mechanical constraints.
    • Despite Elon's claims, there is always bleeding when you poke electrodes that large into the cortex; the capillary bed is too dense. Let's assume Elon meant 'macro' bleeding, which is true. At least the robot avoids visible vessels.
    • Predicting joint angles for cyclical behavior is not challenging; can be done with EMG or microphonic noise correlated to some part of the gait. Hence the request for monkey BMI data.
  • Given the risk, pretty much any of the "sci-fi" applications mentioned in response to dorky twitter comments can be better provided to neurologically normal people through electronics, without the risk of brain surgery.
  • Regarding sci-fi application linguistic telepathy:
    • First, agreed, clarifying thoughts into language takes effort. This is a mostly unavoidable and largely good task. Interfacing with the external world is a vital part of cognition; shortcutting it, in my estimation, will just lead to sloppy & half-formed ideas not worth communicating. The compression of thoughts into words (as lossy as it may be) is the primary way to make them discrete enough to be meaningful to both other people and yourself.
    • Secondly: speech (or again any of the many other forms of communication) is not that much slower than cognition. If it was, we'd have much larger vocabularies, much more complicated and meaning-conveying grammar, etc (Like Latin?). The limit is the average persons cognition and memory. I disagree with Elon's conceit.
  • Regarding visual telepathy, with sufficient recording capabilities, I see no reason why you couldn't have a video-out port on the brain. Difficult given the currently mostly unknown representation of higher-level visual cortices, but as Ian says, once you have a good oscilloscope, this can be deduced.
  • Regarding AI symbiosis @1:09:19; this logic is not entirely clear to me. AI is a tool that will automate & facilitate the production and translation of knowledge much the same way electricity etc automated & facilitated the production and transportation of physical goods. We will necessarily need to interface with it, but to the point that we are thoroughly modifying our own development & biology, those interfaces will likely be based on presently extant computer interfaces.
    • If we do start modifying the biological wiring structure of our brains, I can't imagine that there will many limits! (Outside hard metabolic limits that brain vasculature takes pains to allocate and optimize.)
    • So, I guess the central tenet might be vaguely ok if you allow that humans are presently symbiotic with cell phones. (A more realistic interpretation is that cell phones are tools, and maybe Google etc are the symbionts / parasites). This is arguably contributing to current political existential crises -- no need to look further. If you do look further, it's not clear that stabbing the brains of healthy individuals will help.
    • I find the MC to be slightly unctuous and ingratiating in a way appropriate for a video game company, but not for a medical device company. That, of course, is a judgement call & matter of taste. Yet, as this was partly a recruiting event ... you will find who you set the table for.

hide / / print
ref: -0 tags: synaptic plasticity 2-photon imaging inhibition excitation spines dendrites synapses 2p date: 08-14-2020 01:35 gmt revision:3 [2] [1] [0] [head]

PMID-22542188 Clustered dynamics of inhibitory synapses and dendritic spines in the adult neocortex.

  • Cre-recombinase-dependent labeling of postsynapitc scaffolding via Gephryn-Teal fluorophore fusion.
  • Also added Cre-eYFP to label the neurons
  • Electroporated in utero e16 mice.
    • Low concentration of Cre, high concentrations of Gephryn-Teal and Cre-eYFP constructs to attain sparse labeling.
  • Located the same dendrite imaged in-vivo in fixed tissue - !! - using serial-section electron microscopy.
  • 2230 dendritic spines and 1211 inhibitory synapses from 83 dendritic segments in 14 cells of 6 animals.
  • Some spines had inhibitory synapses on them -- 0.7 / 10um, vs 4.4 / 10um dendrite for excitatory spines. ~ 1.7 inhibitory
  • Suggest that the data support the idea that inhibitory inputs maybe gating excitation.
  • Furthermore, co-inervated spines are stable, both during mormal experience and during monocular deprivation.
  • Monocular deprivation induces a pronounced loss of inhibitory synapses in binocular cortex.

hide / / print
ref: -2013 tags: 2p two photon STED super resolution microscope synapse synaptic plasticity date: 08-14-2020 01:34 gmt revision:3 [2] [1] [0] [head]

PMID-23442956 Two-Photon Excitation STED Microscopy in Two Colors in Acute Brain Slices

  • Plenty of details on how they set up the microscope.
  • Mice: Thy1-eYFP (some excitatory cells in the hippocampus and cortex) and CX3CR1-eGFP (GFP in microglia). Crossbred the two strains for two-color imaging.
  • Animals were 21-40 days old at slicing.

PMID-29932052 Chronic 2P-STED imaging reveals high turnover of spines in the hippocampus in vivo

  • As above, Thy1-GFP / Thy1-YFP labeling; hence this was a structural study (for which the high resolution of STED was necessary).
  • Might just as well gone with synaptic labels, e.g. tdTomato-Synapsin.

hide / / print
ref: -0 tags: synaptic plasticity LTP LTD synapses NMDA glutamate uncaging date: 08-11-2020 22:40 gmt revision:0 [head]

PMID-31780899 Single Synapse LTP: A matter of context?

  • Not a great name for a thorough and reasonably well-written review of glutamate uncaging studies as related to LTP (and to a lesser extent LTD).
  • Lots of refernces from many familiar names. Nice to have them all in one place!
  • I'm left wondering, between CaMKII, PKA, PKC, Ras, other GTP dependent molecules -- how much of the regulatory network in synapse is known? E.g. if you pull down all proteins in the synaptosome & their interacting partners, how many are unknown, or have an unknown function? I know something like this has been done for flies, but in mammals - ?

hide / / print
ref: -0 tags: GEVI review voltage sensor date: 08-10-2020 22:22 gmt revision:24 [23] [22] [21] [20] [19] [18] [head]

Various GEVIs invented and evolved:

Ace-FRET sensors

  • PMID-26586188 Ace-mNeonGreen, an opsin-FRET sensor, might still be better in terms of SNR, but it's green.
    • Negative ΔF/F\Delta F / F with depolarization.
    • Fast enough to resolve spikes.
    • Rational design; little or no screening.
    • Ace is about six times as fast as Mac, and mNeonGreen has a ~50% higher extinction coefficient than mCitrine and nearly threefold better photostability (12)

  • PMID-31685893 A High-speed, red fluorescent voltage sensor to detect neural activity
    • Fusion of Ace2N + short linker + mScarlet, a bright (if not the brightest; highest QY) monomeric red fluorescent protein.
    • Almost as good SNR as Ace2N-mNeonGreen.
    • Also a FRET sensor; negative delta F with depolarization.
    • Ace2N-mNeon is not sensitive under two-photon illumination; presumably this is true of all eFRET sensors?
    • Ace2N drives almost no photocurrent.
    • Sought to maximize SNR: dF/F_0 X sqrt(F_0); screened 'only' 18 linkers to see what worked the best. Yet - it's better than VARNAM.
    • ~ 14% dF/F per 100mV depolarization.

Arch and Mac rhodopsin sensors

  • PMID-22120467 Optical recording of action potentials in mammalian neurons using a microbial rhodopsin Arch 2011
    • Endogenous fluorescence of the retinal (+ environment) of microbial rhodopsin protein Archaerhodopsin 3 (Arch) from Halorubrum sodomense.
    • Proton pump without proton pumping capabilities also showed voltage dependence, but slower kinetics.
      • This required one mutation, D95N.
    • Requires fairly intense illumination, as the QY of the fluorophore is low (9 x 10-4). Still, photobleaching rate was relatively low.
    • Arch is mainly used for neuronal inhibition.

  • PMID-25222271 Archaerhodopsin Variants with Enhanced Voltage Sensitive Fluorescence in Mammalian and Caenorhabditis elegans Neurons Archer1 2014
    • Capable of voltage sensing under red light, and inhibition (via proton pumping) under green light.
    • Note The high laser power used to excite Arch (above) fluorescence causes significant autofluorescence in intact tissue and limits its accessibility for widespread use.
    • Archers have 3-5x the fluorescence of WT Arch -- so, QY of ~3.6e-3. Still very dim.
    • Archer1 dF/F_0 85%; Archer2 dF/F_0 60% @ 100mV depolarization (positive sense).
    • Screened the proton pump of Gloeobacter violaceus rhodopsin; found mutations were then transferred to Arch.
      • Maybe they were planning on using the Geobacter rhodopsin, but it didn't work for some reason, so they transferred to Arch..
    • TS and ER export domains for localization.

  • PMID-24755708 Imaging neural spiking in brain tissue using FRET-opsin protein voltage sensors MacQ-mOrange and MacQ-mCitrine.
    • L. maculans (Mac) rhodopsin (faster than Arch) + FP mCitrine, FRET sensor + ER/TS.
    • Four-fold faster kinetics and 2-4x brighter than ArcLight.
      • No directed evolution to optimize sensitivity or brightness. Just kept the linker short & trimmed residues based on crystal structure.
    • ~5% delta F/F, can resolve spikes up to 10Hz.
    • Spectroscopic studies of the proton pumping photocycle in bacteriorhodopsin and Archaerhodopsin (Arch) have revealed that proton translocation through the retinal Schiff base changes chromophore absorption [24-26]
    • Used rational design to abolish the proton current (D139N and D139Q aka MacQ) ; screens to adjust the voltage sensing kinetics.
    • Still has photocurrents.
    • Seems that slice / in vivo is consistently worse than cultured neurons... in purkinje neurons, dF/F 1.2%, even though in vitro response was ~ 15% to a 100mV depolarization.
    • Imaging intensity 30mw/mm^2. (3W/cm^2)

  • PMID-24952910 All-optical electrophysiology in mammalian neurons using engineered microbial rhodopsins QuasAr1 and QuasAr1 2014
    • Directed evolution approach to improve the brightness and speed of Arch D95N.
      • Improved the fluorescence QY by 19 and 10x. (1 and 2, respectively -- Quasar2 has higher sensitivity).
    • also developed a low-intensity channelrhodopsin, Cheriff, which can be activated by blue light (lambda max = 460 nm)dim enough to not affect QuasAr.
    • They call the two of them 'Optopatch 2'.
    • Incident light intensity 1kW / cm^2 (!)

  • PMID-29483642 A robotic multidimensional directed evolution approach applied to fluorescent voltage reporters. Archon1 2018
    • Started with QuasAr2 (above), which was evolved from Arch. Intrinsic fluorescence of retinal in rhodopsin.
    • Expressed in HEK293T cells; then FACS, robotic cell picking, whole genome amplification, PCR, cloning.
    • Also evolved miRFP, deep red fluorescent protein based on bacteriophytochrome.
    • delta F/F of 80 and 20% with a 100mV depolarization.
    • We investigated the contribution of specific point mutations to changes in localization, brightness, voltage sensitivity and kinetics and found the patterns that emerged to be complex (Supplementary Table 6), with a given mutation often improving one parameter but worsening another.
    • If the original QY of Arch was 9e-4, and Quasar2 improved this by 10, and Archon1 improved this by 2.3x, then the QY of Archon1 is 0.02. Given the molar extinction coefficient is ~ 50000 for retinal, this means the brightness of the fluorescent probe is low, 1. (good fluorescent proteins and synthetic dyes have a brightness of ~90).
  • Imaged using 637nm laser light at 800mW/mm2 for Archon1 and Archon2; emission filtered through 664LP

VSD - FP sensors

  • PMID-28811673 Improving a genetically encoded voltage indicator by modifying the cytoplasmic charge composition Bongwoori 2017
    • ArcLight derivative.
    • Arginine (positive charge) scanning mutagenesis of the linker region improved the signal size of the GEVI, Bongwoori, yielding fluorescent signals as high as 20% ΔF/F during the firing of action potentials.
    • Used the mutagenesis to shift the threshold for fluorescence change more negative, ~ -30mV.
    • Like ArcLight, it's slow.
    • Strong baseline shift due to the acidification of the neuron during AP firing (!)

  • Attenuation of synaptic potentials in dentritic spines
    • Found that SNR / dF / F_0 is limited by intracellular localization of the sensor.
      • This is true even though ArcLight is supposed to be in a dark state in the lower pH of intracellular organelles.. a problem worth considering.
      • Makes negative-going GEVI's more practical, as those not in the membrane are dark @ 0mV.

  • Fast two-photon volumetric imaging of an improved voltage indicator reveals electrical activity in deeply located neurons in the awake brain ASAP3 2018
    • Opsin-based GEVIs have been used in vivo with 1p excitation to report electrical activity of superficial neurons, but their responsivity is attenuated for 2p excitation. (!)
    • Site-directed evolution in HEK cells.
    • Expressed linear PCR products directly in the HEK cells, with no assembly / ligation required! (Saves lots of time: normally need to amplify, assemble into a plasmid, transfect, culture, measure, purify the plasimd, digest, EP PCR, etc).
    • Screened in a motorized 384-well conductive plate, electroporation electrode sequentially activates each on an upright microscope.
    • 46% improvement over ASAP2 R414Q
    • Ace2N-4aa-mNeon is not responsive under 2p illum; nor is Archon1 or Quasar2/3
    • ULOVE = AOD based fast local scanning 2-p random access scope.

  • Bright and tunable far-red chemigenetic indicators
    • GgVSD (same as ASAP above) + cp HaloTag + Si-Rhodamine JF635
    • ~ 4% dF/F_0 during APs.
    • Found one mutation, R476G in the linker between cp Halotag and S4 of the VSD, which doubled the sensitivity of HASAP.
    • Also tested a ArcLight type structure, CiVSD fused to Halotag.
      • HarcLght had negative dF/F_0 and ~ 3% change in response to APs.
    • No voltage sensitivity when the synthetic dye was largely in the zwitterionic form, eg. tetramethylrodamine.

hide / / print
ref: -2015 tags: spiking neural networks causality inference demixing date: 07-22-2020 18:13 gmt revision:1 [0] [head]

PMID-26621426 Causal Inference and Explaining Away in a Spiking Network

  • Rubén Moreno-Bote & Jan Drugowitsch
  • Use linear non-negative mixing plus nose to generate a series of sensory stimuli.
  • Pass these through a one-layer spiking or non-spiking neural network with adaptive global inhibition and adaptive reset voltage to solve this quadratic programming problem with non-negative constraints.
  • N causes, one observation: μ=Σ i=1 Nu ir i+ε \mu = \Sigma_{i=1}^{N} u_i r_i + \epsilon ,
    • r i0r_i \geq 0 -- causes can be present or not present, but not negative.
    • cause coefficients drawn from a truncated (positive only) Gaussian.
  • linear spiking network with symmetric weight matrix J=U TUβI J = -U^TU - \beta I (see figure above)
    • That is ... J looks like a correlation matrix!
    • UU is M x N; columns are the mixing vectors.
    • U is known beforehand and not learned
      • That said, as a quasi-correlation matrix, it might not be so hard to learn. See ref [44].
  • Can solve this problem by minimizing the negative log-posterior function: $$ L(\mu, r) = \frac{1}{2}(\mu - Ur)^T(\mu - Ur) + \alpha1^Tr + \frac{\beta}{2}r^Tr $$
    • That is, want to maximize the joint probability of the data and observations given the probabilistic model p(μ,r)exp(L(μ,r))Π i=1 NH(r i) p(\mu, r) \propto exp(-L(\mu, r)) \Pi_{i=1}^{N} H(r_i)
    • First term quadratically penalizes difference between prediction and measurement.
    • second term, alpha is a L1 regularization term, and third term w beta is a L2 regularization.
  • The negative log-likelihood is then converted to an energy function (linear algebra): W=U TUW = -U^T U , h=U Tμ h = U^T \mu then E(r)=0.5r TWrr Th+α1 Tr+0.5βr TrE(r) = 0.5 r^T W r - r^T h + \alpha 1^T r + 0.5 \beta r^T r
    • This is where they get the weight matrix J or W. If the vectors U are linearly independent, then it is negative semidefinite.
  • The dynamics of individual neurons w/ global inhibition and variable reset voltage serves to minimize this energy -- hence, solve the problem. (They gloss over this derivation in the main text).
  • Next, show that a spike-based network can similarly 'relax' or descent the objective gradient to arrive at the quadratic programming solution.
    • Network is N leaky integrate and fire neurons, with variable synaptic integration kernels.
    • α\alpha translates then to global inhibition, and β\beta to lowered reset voltage.
  • Yes, it can solve the problem .. and do so in the presence of firing noise in a finite period of time .. but a little bit meh, because the problem is not that hard, and there is no learning in the network.

hide / / print
ref: -2017 tags: GraphSAGE graph neural network GNN date: 07-16-2020 15:49 gmt revision:2 [1] [0] [head]

Inductive representation learning on large graphs

  • William L. Hamilton, Rex Ying, Jure Leskovec
  • Problem: given a graph where each node has a set of (possibly varied) attributes, create a 'embedding' vector at each node that describes both the node and the network that surrounds it.
  • To this point (2017) there were two ways of doing this -- through matrix factorization methods, and through graph convolutional networks.
    • The matrix factorization methods or spectral methods (similar to multi-dimensional scaling, where points are projected onto a plane to preserve a distance metric) are transductive : they work entirely within-data, and don't directly generalize to new data.
      • This is parsimonious in some sense, but doesn't work well in the real world, where datasets are constantly changing and frequently growing.
  • Their approach is similar to graph convolutional networks, where (I think) the convolution is indexed by node distances.
  • General idea: each node starts out with an embedding vector = its attribute or feature vector.
  • Then, all neighboring nodes are aggregated by sampling a fixed number of the nearest neighbors (fixed for computational reasons).
    • Aggregation can be mean aggregation, LSTM aggregation (on random permuations of the neighbor nodes), or MLP -> nonlinearity -> max-pooling. Pooling has the most wins, though all seem to work...
  • The aggregated vector is concatenated with the current node feature vector, and this is fed through a learned weighting matrix and nonlinearity to output the feature vector for the current pass.
  • Passes proceed from out-in... i think.
  • Algorithm is inspired by the Weisfeiler-Lehman Isomorphism Test, which updates neighbor counts per node to estimate if graphs are isomorphic. They do a similar thing here, only with vectors not scalars, and similarly take into account the local graph structure.
    • All the aggregator functions, and for course the nonlinearities and weighting matricies, are differentiable -- so the structure is trained in a supervised way with SGD.

This is a well-put together paper, with some proofs of convergence etc -- but it still feels only lightly tested. As with many of these papers, could benefit from a positive control, where the generating function is known & you can see how well the algorithm discovers it.

Otherwise, the structure / algorithm feels rather intuitive; surprising to me that it was not developed before the matrix factorization methods.

Worth comparing this to word2vec embeddings, where local words are used to predict the current word & the resulting vector in the neck-down of the NN is the representation.

hide / / print
ref: -0 tags: bleaching STED dye phosphorus japan date: 07-16-2020 14:06 gmt revision:1 [0] [head]

Super-Photostable Phosphole-Based Dye for Multiple-Acquisition Stimulated Emission Depletion Imaging

  • Use the electron withdrawing ability of a phosphole group (P = O) to reduce photobleaching
  • Derived from another photostable dye, C-Naphox, only with a different mechanism of fluorescence -- pi-pi* transfer rather than intramolecular charge transfer (ICT).
  • Much more stable than Alexa 488 (aka sulfonated fluorescein, which is not the most stable dye..)
  • Suitable for multiple STED images, unlike the other dyes. (Note!)

hide / / print
ref: -2011 tags: two photon cross section fluorescent protein photobleaching Drobizhev date: 07-10-2020 21:09 gmt revision:8 [7] [6] [5] [4] [3] [2] [head]

PMID-21527931 Two-photon absorption properties of fluorescent proteins

  • Significant 2-photon cross section of red fluorescent proteins (same chromophore as DsRed) in the 700 - 770nm range, accessible to Ti:sapphire lasers ...
    • This corresponds to a S 0S nS_0 \rightarrow S_n transition
    • But but, photobleaching is an order of magnitude slower when excited by the direct S 0S 1S_0 \rightarrow S_1 transition (but the fluorophores can be significantly less bright in this regime).
      • Quote: the photobleaching of DsRed slows down by an order of magnitude when the excitation wavelength is shifted to the red, from 750 to 950 nm (32).
    • See also PMID-18027924
  • Further work by same authors: Absolute Two-Photon Absorption Spectra and Two-Photon Brightness of Orange and Red Fluorescent Proteins
    • " TagRFP possesses the highest two-photon cross section, σ2 = 315 GM, and brightness, σ2φ = 130 GM, where φ is the fluorescence quantum yield. At longer wavelengths, 1000–1100 nm, tdTomato has the largest values, σ2 = 216 GM and σ2φ = 120 GM, per protein chain. Compared to the benchmark EGFP, these proteins present 3–4 times improvement in two-photon brightness."
    • "Single-photon properties of the FPs are poor predictors of which fluorescent proteins will be optimal in two-photon applications. It follows that additional mutagenesis efforts to improve two-photon cross section will benefit the field."
  • 2P cross-section in both the 700-800nm and 1000-1100 nm range corresponds to the chromophore polarizability, and is not related to 1p cross section.
  • This can be useflu for multicolor imaging: excitation of the higher S0 → Sn transition of TagRFP simultaneously with the first, S0 → S1, transition of mKalama1 makes dual-color two-photon imaging possible with a single excitation laser wavelength (13)
  • Why are red GECIs based on mApple (rGECO1) or mRuby (RCaMP)? dsRed2 or TagRFP are much better .. but maybe they don't have CP variants.
  • from https://elifesciences.org/articles/12727

hide / / print
ref: -0 tags: constitutional law supreme court date: 06-03-2020 01:40 gmt revision:0 [head]

Spent a while this evening reading about Qualified Immunity -- the law that permits government officials (e.g. police officers) immunity when 'doing their jobs'. It's perhaps one root of the George Floyd / racism protests, as it has set a precedent that US police can be violent and get away with it. (This is also related to police unions and collective liability loops... anyway)

The supreme court has the option to take cases challenging the constitutionality of Qualified Immunity, which many on both sides of the political spectrum want them to do.

It 'got' this power via Marbury vs. Madison. M v. M is self-referential genius:

  • They ruled the original action (blocking an appointment) was illegal
  • but the court does not have the power to make these decisions
  • because the congressional law that gave the Supreme Court that power was unconstitutional.
  • Instead, the supreme court has the power to decide if laws (in this case, those governing its jurisdiction) are constitutional.
  • E.g. SCOTUS initiated judicial review & expansion of it's jurisdiction over Congressional law by repealing a law that expanded it's jurisdiction by congress.
  • This was also done while threading the loops to satisfy then-present political pressure (who wanted the original appointment to be illegal) so that they (Thomas Jefferson) were aligned with the increase in power, so the precedent could persist.

As a person curious how systems gain complexity and feedback loops ... so much nerdgasm.

hide / / print
ref: -0 tags: rutherford journal computational theory neumann complexity wolfram date: 05-05-2020 18:15 gmt revision:0 [head]

The Structures for Computation and the Mathematical Structure of Nature

  • Broad, long, historical.

hide / / print
ref: -2017 tags: google deepmind compositional variational autoencoder date: 04-08-2020 01:16 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

SCAN: learning hierarchical compositional concepts

  • From DeepMind, first version Jul 2017 / v3 June 2018.
  • Starts broad and strong:
    • "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules"
      • Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details..
    • "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts"
    • "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts."
    • "Compositionality is at the core of such human abilities as creativity, imagination, and language-based communication.
    • This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision.
  • Approach:
    • Factorize the visual world with a Β\Beta -VAE to learn a set of representational primitives through unsupervised exposure to visual data.
    • Expose SCAN (or rather, a module of it) to a small number of symbol-image pairs, from which the algorithm identifies the set if visual primitives (features from beta-VAE) that the examples have in common.
      • E.g. this is purely associative learning, with a finite one-layer association matrix.
    • Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..)
    • Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( \cup ), IN-COMMON ( \cap ) & IGNORE ( \setminus or '-'). This is via a low-parameter convolutional model.
  • Notation:
    • q ϕ(z x|x)q_{\phi}(z_x|x) is the encoder model. ϕ\phi are the encoder parameters, xx is the visual input, z xz_x are the latent parameters inferred from the scene.
    • p theta(x|z x)p_{theta}(x|z_x) is the decoder model. xp θ(x|z x)x \propto p_{\theta}(x|z_x) , θ\theta are the decoder parameters. xx is now the reconstructed scene.
  • From this, the loss function of the beta-VAE is:
    • 𝕃(θ,ϕ;x,z x,β)=𝔼 q ϕ(z x|x)[logp θ(x|z x)]βD KL(q ϕ(z x|x)||p(z x)) \mathbb{L}(\theta, \phi; x, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_x|x)} [log p_{\theta}(x|z_x)] - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x)) where Β>1\Beta \gt 1
      • That is, maximize the auto-encoder fit (the expectation of the decoder, over the encoder output -- aka the pixel log-likelihood) minus the KL divergence between the encoder distribution and p(z x)p(z_x)
        • p(z)𝒩(0,I)p(z) \propto \mathcal{N}(0, I) -- diagonal normal matrix.
        • β\beta comes from the Lagrangian solution to the constrained optimization problem:
        • max ϕ,θ𝔼 xD[𝔼 q ϕ(z|x)[logp θ(x|z)]]\max_{\phi,\theta} \mathbb{E}_{x \sim D} [\mathbb{E}_{q_{\phi}(z|x)}[log p_{\theta}(x|z)]] subject to D KL(q ϕ(z|x)||p(z))<εD_{KL}(q_{\phi}(z|x)||p(z)) \lt \epsilon where D is the domain of images etc.
      • Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual de-tangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising auto-encoder ref, which uses the feature L2 norm instead of the pixel log-likelihood:
    • 𝕃(θ,ϕ;X,z x,β)=𝔼 q ϕ(z x|x)||J(x^)J(x)|| 2 2βD KL(q ϕ(z x|x)||p(z x)) \mathbb{L}(\theta, \phi; X, z_x, \beta) = -\mathbb{E}_{q_{\phi}(z_x|x)}||J(\hat{x}) - J(x)||_2^2 - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x)) where J: WxHxC NJ : \mathbb{R}^{W x H x C} \rightarrow \mathbb{R}^N maps from images to high-level features.
      • This J(x)J(x) is from another neural network (transfer learning) which learns features beforehand.
      • It's a multilayer perceptron denoising autoencoder [Vincent 2010].
  • The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs yy and the latent outputs from encoder z xz_x given xx .
  • In this way, they can present a description yy to the network, which is then recomposed into z yz_y , that then produces an image x^\hat{x} .
    • The whole network is trained by minimizing:
    • 𝕃 y(θ y,ϕ y;y,x,z y,β,λ)=1 st2 nd3 rd \mathbb{L}_y(\theta_y, \phi_y; y, x, z_y, \beta, \lambda) = 1^{st} - 2^{nd} - 3^{rd}
      • 1st term: 𝔼 q ϕ y(z y|y)[logp θ y(y|z y)] \mathbb{E}_{q_{\phi_y}(z_y|y)}[log p_{\theta_y} (y|z_y)] log-likelihood of the decoded symbols given encoded latents z yz_y
      • 2nd term: βD KL(q ϕ y(z y|y)||p(z y)) \beta D_{KL}(q_{\phi_y}(z_y|y) || p(z_y)) weighted KL divergence between encoded latents and diagonal normal prior.
      • 3rd term: λD KL(q ϕ x(z x|y)||q ϕ y(z y|y))\lambda D_{KL}(q_{\phi_x}(z_x|y) || q_{\phi_y}(z_y|y)) weighted KL divergence between latents from the images and latents from the description yy .
        • They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right.
  • Final element! A convolutional recombination element, implemented as a tensor product between z y1z_{y1} and z y2z_{y2} that outputs a one-hot encoding of set-operation that's fed to a (hardcoded?) transformation matrix.
    • I don't think this is great shakes. Could have done this with a small function; no need for a neural network.
    • Trained with very similar loss function as SCAN or the beta-VAE.

  • Testing:
  • They seem to have used a very limited subset of "DeepMind Lab" -- all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing.
  • This is marginally more interesting -- the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.)
  • Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs.

hide / / print
ref: -2020 tags: evolution neutral drift networks random walk entropy population date: 04-08-2020 00:48 gmt revision:0 [head]

Localization of neutral evolution: selection for mutational robustness and the maximal entropy random walk

  • The take-away of the paper is that, with larger populations, random mutation and recombination make areas of the graph that take several steps to get to (in the figure, this is Maynard Smith's four-letter mutation word game) are less likely to be visited with a larger population.
  • This is because the recombination serves to make the population adhere more closely to the 'giant' mode. In Maynard's game, this is 2268 words of 2405 meaningful words that can be reached by successive letter changes.
  • The author extends it to van Nimwegen's 1999 paper / RNA genotype-secondary structure. It's not as bad as Maynard's game, but still has much lower graph-theoretic entropy than the actual population.
    • He suggests if the entropic size of the giant component is much smaller than it's dictionary size, then populations are likely to be trapped there.

  • Interesting, but I'd prefer to have an expert peer-review it first :)

hide / / print
ref: -0 tags: asymmetric locality sensitive hash maximum inner product search sparsity date: 03-30-2020 02:17 gmt revision:5 [4] [3] [2] [1] [0] [head]

Improved asymmetric locality sensitive hashing for maximum inner product search

  • Like many other papers, this one is based on a long lineage of locality-sensitive hashing papers.
  • Key innovation, in [23] The power of asymmetry in binary hashing, was the development of asymmetric hashing -- the hash function of the query is different than the hash function used for storage. Roughly, this allows additional degrees of freedom since the similarity-function is (in the non-normalized case) non-symmetric.
    • For example, take query Q = [1 1] with keys A = [1 -1] and B = [3 3]. The nearest neighbor is A (distance 2), whereas the maximum inner product is B (inner product 6).
    • Alternately: self-inner product for Q and A is 2, whereas for B it's 18. Self-similarity is not the highest with inner products.
    • Norm of the query does not have an effect on the arg max of the search, though. Hence, for the paper assume that the query has been normalized for MIPS.
  • In this paper instead they convert MIPS into approximate cosine similarity search (which is like normalized MIPS), which can be efficiently solved with signed random projections.
  • (Established): LSH-L2 distance:
    • Sample a random vector a, iid normal N(0,1)
    • Sample a random normal b between 0 and r
      • r is the window size / radius (free parameters?)
    • Hash function is then the floor of the inner product of the vector a and input x + b divided by the radius.
      • I'm not sure about how the floor op is converted to bits of the actual hash -- ?
  • (Established): LSH-correlation, signed random projections h signh^{sign} :
    • Hash is the sign of the inner product of the input vector and a uniform random vector a.
    • This is a two-bit random projection [13][14].
  • (New) Asymmetric-LSH-L2:
    • P(x)=[x;||x|| 2 2;||x|| 2 4;....;||x|| 2 2 m]P(x) = [x;||x||^2_2; ||x||^4_2; .... ; ||x||^{2^m}_2] -- this is the pre-processing hashing of the 'keys'.
      • Requires that then norm of these keys, {||x||}_2 < U < 1$$
      • m3 m \geq 3
    • Q(x)=[x;1/2;1/2;...;1/2]Q(x) = [x;1/2; 1/2; ... ; 1/2] -- hashing of the queries.
    • See the mathematical explanation in the paper, but roughly "transformations P and Q, when normas are less than 1, provide correction to the L2 distance ||Q(p)P(x i)|| 2||Q(p) - P(x_i)||_2 , making in rank correlate with un-normalized inner product."
  • They then change the augmentation to:
    • P(x)=[x;1/2||x|| 2 2;1/2||x|| 2 4;...;1/2||x|| 2 2 m]P(x) = [x; 1/2 - ||x||^2_2; 1/2 - ||x||^4_2; ... ; 1/2 - ||x||^{2^m}_2]
    • Q(x)=[x;0;...;0]Q(x) = [x; 0; ...; 0]
    • This allows use of signed nearest-neighbor search to be used in the MIPS problem. (e.g. the hash is the sign of P and Q, per above; I assume this is still a 2-bit operation?)
  • Then the expand the U,M compromise function ρ\rho to allow for non-normalized queries. U depends on m and c (m is the codeword extension, and c is the ratio between o-target and off-target hash hits.
  • Tested on Movielens and Netflix databases, this using SVD preprocessing on the user-item matrix (full rank matrix indicating every user rating on every movie (mostly zeros!)) to get at the latent vectors.
  • In the above plots, recall (hah) that precision is the number of true positives / number of false positives as the number of draws k increases; recall is the number of true positives / number of draws k.
    • Clearly, the curve bends up and to the right when there are a lot of hash tables K.
    • Example datapoint: 50% precision at 40% recall, top 5. So on average you get 2 correct hits in 4 draws. Or: 40% precision, 20% recall, top 10: 2 hits in 5 draws. 20/40: 4 hits in 20 draws. (hit: correctly within the top-N)
    • So ... it's not that great.

Use case: Capsule: a camera based positioning system using learning
  • Uses 512 SIFT features as keys and queries to LSH. Hashing is computed via sparse addition / subtraction algorithm, with K bits per hash table (not quite random projections) and L hash tables. K = 22 and L = 24. ~ 1000 training images.
  • Best matching image is used as the location of the current image.

hide / / print
ref: -0 tags: reinforcement learning distribution DQN Deepmind dopamine date: 03-30-2020 02:14 gmt revision:5 [4] [3] [2] [1] [0] [head]

PMID-31942076 A distributional code for value in dopamine based reinforcement learning

  • Synopsis is staggeringly simple: dopamine neurons encode / learn to encode a distribution of reward expectations, not just the mean (aka the expected value) of the reward at a given state-action pair.
  • This is almost obvious neurally -- of course dopamine neurons in the striatum represent different levels of reward expectation; there is population diversity in nearly everything in neuroscience. The new interpretation is that neurons have different slopes for their susceptibility to positive and negative rewards (or rather, reward predictions), which results in different inflection points where the neurons are neutral about a reward.
    • This constitutes more optimistic and pessimistic neurons.
  • There is already substantial evidence that such a distributional representation enhances performance in DQN (Deep q-networks) from circa 2017; the innovation here is that it has been extended to experiments from 2015 where mice learned to anticipate water rewards with varying volume, or varying probability of arrival.
  • The model predicts a diversity of asymmetry below and above the reversal point
  • Also predicts that the distribution of reward responses should be decoded by neural activity ... which it is ... but it is not surprising that a bespoke decoder can find this information in the neural firing rates. (Have not examined in depth the decoding methods)
  • Still, this is a clear and well-written, well-thought out paper; glad to see new parsimonious theories about dopamine out there.