m8ta
use https for features.
text: sort by
tags: modified
type: chronology
{1417}
hide / / print
ref: -0 tags: synaptic plasticity 2-photon imaging inhibition excitation spines dendrites synapses 2p date: 08-14-2020 01:35 gmt revision:3 [2] [1] [0] [head]

PMID-22542188 Clustered dynamics of inhibitory synapses and dendritic spines in the adult neocortex.

  • Cre-recombinase-dependent labeling of postsynapitc scaffolding via Gephryn-Teal fluorophore fusion.
  • Also added Cre-eYFP to label the neurons
  • Electroporated in utero e16 mice.
    • Low concentration of Cre, high concentrations of Gephryn-Teal and Cre-eYFP constructs to attain sparse labeling.
  • Located the same dendrite imaged in-vivo in fixed tissue - !! - using serial-section electron microscopy.
  • 2230 dendritic spines and 1211 inhibitory synapses from 83 dendritic segments in 14 cells of 6 animals.
  • Some spines had inhibitory synapses on them -- 0.7 / 10um, vs 4.4 / 10um dendrite for excitatory spines. ~ 1.7 inhibitory
  • Suggest that the data support the idea that inhibitory inputs maybe gating excitation.
  • Furthermore, co-inervated spines are stable, both during mormal experience and during monocular deprivation.
  • Monocular deprivation induces a pronounced loss of inhibitory synapses in binocular cortex.

{1478}
hide / / print
ref: -2013 tags: 2p two photon STED super resolution microscope synapse synaptic plasticity date: 08-14-2020 01:34 gmt revision:3 [2] [1] [0] [head]

PMID-23442956 Two-Photon Excitation STED Microscopy in Two Colors in Acute Brain Slices

  • Plenty of details on how they set up the microscope.
  • Mice: Thy1-eYFP (some excitatory cells in the hippocampus and cortex) and CX3CR1-eGFP (GFP in microglia). Crossbred the two strains for two-color imaging.
  • Animals were 21-40 days old at slicing.

PMID-29932052 Chronic 2P-STED imaging reveals high turnover of spines in the hippocampus in vivo

  • As above, Thy1-GFP / Thy1-YFP labeling; hence this was a structural study (for which the high resolution of STED was necessary).
  • Might just as well gone with synaptic labels, e.g. tdTomato-Synapsin.

{1518}
hide / / print
ref: -0 tags: synaptic plasticity LTP LTD synapses NMDA glutamate uncaging date: 08-11-2020 22:40 gmt revision:0 [head]

PMID-31780899 Single Synapse LTP: A matter of context?

  • Not a great name for a thorough and reasonably well-written review of glutamate uncaging studies as related to LTP (and to a lesser extent LTD).
  • Lots of refernces from many familiar names. Nice to have them all in one place!
  • I'm left wondering, between CaMKII, PKA, PKC, Ras, other GTP dependent molecules -- how much of the regulatory network in synapse is known? E.g. if you pull down all proteins in the synaptosome & their interacting partners, how many are unknown, or have an unknown function? I know something like this has been done for flies, but in mammals - ?

{1504}
hide / / print
ref: -0 tags: GEVI review voltage sensor date: 08-10-2020 22:22 gmt revision:24 [23] [22] [21] [20] [19] [18] [head]

Various GEVIs invented and evolved:

Ace-FRET sensors

  • PMID-26586188 Ace-mNeonGreen, an opsin-FRET sensor, might still be better in terms of SNR, but it's green.
    • Negative ΔF/F\Delta F / F with depolarization.
    • Fast enough to resolve spikes.
    • Rational design; little or no screening.
    • Ace is about six times as fast as Mac, and mNeonGreen has a ~50% higher extinction coefficient than mCitrine and nearly threefold better photostability (12)

  • PMID-31685893 A High-speed, red fluorescent voltage sensor to detect neural activity
    • Fusion of Ace2N + short linker + mScarlet, a bright (if not the brightest; highest QY) monomeric red fluorescent protein.
    • Almost as good SNR as Ace2N-mNeonGreen.
    • Also a FRET sensor; negative delta F with depolarization.
    • Ace2N-mNeon is not sensitive under two-photon illumination; presumably this is true of all eFRET sensors?
    • Ace2N drives almost no photocurrent.
    • Sought to maximize SNR: dF/F_0 X sqrt(F_0); screened 'only' 18 linkers to see what worked the best. Yet - it's better than VARNAM.
    • ~ 14% dF/F per 100mV depolarization.

Arch and Mac rhodopsin sensors

  • PMID-22120467 Optical recording of action potentials in mammalian neurons using a microbial rhodopsin Arch 2011
    • Endogenous fluorescence of the retinal (+ environment) of microbial rhodopsin protein Archaerhodopsin 3 (Arch) from Halorubrum sodomense.
    • Proton pump without proton pumping capabilities also showed voltage dependence, but slower kinetics.
      • This required one mutation, D95N.
    • Requires fairly intense illumination, as the QY of the fluorophore is low (9 x 10-4). Still, photobleaching rate was relatively low.
    • Arch is mainly used for neuronal inhibition.

  • PMID-25222271 Archaerhodopsin Variants with Enhanced Voltage Sensitive Fluorescence in Mammalian and Caenorhabditis elegans Neurons Archer1 2014
    • Capable of voltage sensing under red light, and inhibition (via proton pumping) under green light.
    • Note The high laser power used to excite Arch (above) fluorescence causes significant autofluorescence in intact tissue and limits its accessibility for widespread use.
    • Archers have 3-5x the fluorescence of WT Arch -- so, QY of ~3.6e-3. Still very dim.
    • Archer1 dF/F_0 85%; Archer2 dF/F_0 60% @ 100mV depolarization (positive sense).
    • Screened the proton pump of Gloeobacter violaceus rhodopsin; found mutations were then transferred to Arch.
      • Maybe they were planning on using the Geobacter rhodopsin, but it didn't work for some reason, so they transferred to Arch..
    • TS and ER export domains for localization.

  • PMID-24755708 Imaging neural spiking in brain tissue using FRET-opsin protein voltage sensors MacQ-mOrange and MacQ-mCitrine.
    • L. maculans (Mac) rhodopsin (faster than Arch) + FP mCitrine, FRET sensor + ER/TS.
    • Four-fold faster kinetics and 2-4x brighter than ArcLight.
      • No directed evolution to optimize sensitivity or brightness. Just kept the linker short & trimmed residues based on crystal structure.
    • ~5% delta F/F, can resolve spikes up to 10Hz.
    • Spectroscopic studies of the proton pumping photocycle in bacteriorhodopsin and Archaerhodopsin (Arch) have revealed that proton translocation through the retinal Schiff base changes chromophore absorption [24-26]
    • Used rational design to abolish the proton current (D139N and D139Q aka MacQ) ; screens to adjust the voltage sensing kinetics.
    • Still has photocurrents.
    • Seems that slice / in vivo is consistently worse than cultured neurons... in purkinje neurons, dF/F 1.2%, even though in vitro response was ~ 15% to a 100mV depolarization.
    • Imaging intensity 30mw/mm^2. (3W/cm^2)

  • PMID-24952910 All-optical electrophysiology in mammalian neurons using engineered microbial rhodopsins QuasAr1 and QuasAr1 2014
    • Directed evolution approach to improve the brightness and speed of Arch D95N.
      • Improved the fluorescence QY by 19 and 10x. (1 and 2, respectively -- Quasar2 has higher sensitivity).
    • also developed a low-intensity channelrhodopsin, Cheriff, which can be activated by blue light (lambda max = 460 nm)dim enough to not affect QuasAr.
    • They call the two of them 'Optopatch 2'.
    • Incident light intensity 1kW / cm^2 (!)

  • PMID-29483642 A robotic multidimensional directed evolution approach applied to fluorescent voltage reporters. Archon1 2018
    • Started with QuasAr2 (above), which was evolved from Arch. Intrinsic fluorescence of retinal in rhodopsin.
    • Expressed in HEK293T cells; then FACS, robotic cell picking, whole genome amplification, PCR, cloning.
    • Also evolved miRFP, deep red fluorescent protein based on bacteriophytochrome.
    • delta F/F of 80 and 20% with a 100mV depolarization.
    • We investigated the contribution of specific point mutations to changes in localization, brightness, voltage sensitivity and kinetics and found the patterns that emerged to be complex (Supplementary Table 6), with a given mutation often improving one parameter but worsening another.
    • If the original QY of Arch was 9e-4, and Quasar2 improved this by 10, and Archon1 improved this by 2.3x, then the QY of Archon1 is 0.02. Given the molar extinction coefficient is ~ 50000 for retinal, this means the brightness of the fluorescent probe is low, 1. (good fluorescent proteins and synthetic dyes have a brightness of ~90).
  • Imaged using 637nm laser light at 800mW/mm2 for Archon1 and Archon2; emission filtered through 664LP

VSD - FP sensors

  • PMID-28811673 Improving a genetically encoded voltage indicator by modifying the cytoplasmic charge composition Bongwoori 2017
    • ArcLight derivative.
    • Arginine (positive charge) scanning mutagenesis of the linker region improved the signal size of the GEVI, Bongwoori, yielding fluorescent signals as high as 20% ΔF/F during the firing of action potentials.
    • Used the mutagenesis to shift the threshold for fluorescence change more negative, ~ -30mV.
    • Like ArcLight, it's slow.
    • Strong baseline shift due to the acidification of the neuron during AP firing (!)

  • Attenuation of synaptic potentials in dentritic spines
    • Found that SNR / dF / F_0 is limited by intracellular localization of the sensor.
      • This is true even though ArcLight is supposed to be in a dark state in the lower pH of intracellular organelles.. a problem worth considering.
      • Makes negative-going GEVI's more practical, as those not in the membrane are dark @ 0mV.

  • Fast two-photon volumetric imaging of an improved voltage indicator reveals electrical activity in deeply located neurons in the awake brain ASAP3 2018
    • Opsin-based GEVIs have been used in vivo with 1p excitation to report electrical activity of superficial neurons, but their responsivity is attenuated for 2p excitation. (!)
    • Site-directed evolution in HEK cells.
    • Expressed linear PCR products directly in the HEK cells, with no assembly / ligation required! (Saves lots of time: normally need to amplify, assemble into a plasmid, transfect, culture, measure, purify the plasimd, digest, EP PCR, etc).
    • Screened in a motorized 384-well conductive plate, electroporation electrode sequentially activates each on an upright microscope.
    • 46% improvement over ASAP2 R414Q
    • Ace2N-4aa-mNeon is not responsive under 2p illum; nor is Archon1 or Quasar2/3
    • ULOVE = AOD based fast local scanning 2-p random access scope.

  • Bright and tunable far-red chemigenetic indicators
    • GgVSD (same as ASAP above) + cp HaloTag + Si-Rhodamine JF635
    • ~ 4% dF/F_0 during APs.
    • Found one mutation, R476G in the linker between cp Halotag and S4 of the VSD, which doubled the sensitivity of HASAP.
    • Also tested a ArcLight type structure, CiVSD fused to Halotag.
      • HarcLght had negative dF/F_0 and ~ 3% change in response to APs.
    • No voltage sensitivity when the synthetic dye was largely in the zwitterionic form, eg. tetramethylrodamine.

{1517}
hide / / print
ref: -2015 tags: spiking neural networks causality inference demixing date: 07-22-2020 18:13 gmt revision:1 [0] [head]

PMID-26621426 Causal Inference and Explaining Away in a Spiking Network

  • Rubén Moreno-Bote & Jan Drugowitsch
  • Use linear non-negative mixing plus nose to generate a series of sensory stimuli.
  • Pass these through a one-layer spiking or non-spiking neural network with adaptive global inhibition and adaptive reset voltage to solve this quadratic programming problem with non-negative constraints.
  • N causes, one observation: μ=Σ i=1 Nu ir i+ε \mu = \Sigma_{i=1}^{N} u_i r_i + \epsilon ,
    • r i0r_i \geq 0 -- causes can be present or not present, but not negative.
    • cause coefficients drawn from a truncated (positive only) Gaussian.
  • linear spiking network with symmetric weight matrix J=U TUβI J = -U^TU - \beta I (see figure above)
    • That is ... J looks like a correlation matrix!
    • UU is M x N; columns are the mixing vectors.
    • U is known beforehand and not learned
      • That said, as a quasi-correlation matrix, it might not be so hard to learn. See ref [44].
  • Can solve this problem by minimizing the negative log-posterior function: $$ L(\mu, r) = \frac{1}{2}(\mu - Ur)^T(\mu - Ur) + \alpha1^Tr + \frac{\beta}{2}r^Tr $$
    • That is, want to maximize the joint probability of the data and observations given the probabilistic model p(μ,r)exp(L(μ,r))Π i=1 NH(r i) p(\mu, r) \propto exp(-L(\mu, r)) \Pi_{i=1}^{N} H(r_i)
    • First term quadratically penalizes difference between prediction and measurement.
    • second term, alpha is a L1 regularization term, and third term w beta is a L2 regularization.
  • The negative log-likelihood is then converted to an energy function (linear algebra): W=U TUW = -U^T U , h=U Tμ h = U^T \mu then E(r)=0.5r TWrr Th+α1 Tr+0.5βr TrE(r) = 0.5 r^T W r - r^T h + \alpha 1^T r + 0.5 \beta r^T r
    • This is where they get the weight matrix J or W. If the vectors U are linearly independent, then it is negative semidefinite.
  • The dynamics of individual neurons w/ global inhibition and variable reset voltage serves to minimize this energy -- hence, solve the problem. (They gloss over this derivation in the main text).
  • Next, show that a spike-based network can similarly 'relax' or descent the objective gradient to arrive at the quadratic programming solution.
    • Network is N leaky integrate and fire neurons, with variable synaptic integration kernels.
    • α\alpha translates then to global inhibition, and β\beta to lowered reset voltage.
  • Yes, it can solve the problem .. and do so in the presence of firing noise in a finite period of time .. but a little bit meh, because the problem is not that hard, and there is no learning in the network.

{1516}
hide / / print
ref: -2017 tags: GraphSAGE graph neural network GNN date: 07-16-2020 15:49 gmt revision:2 [1] [0] [head]

Inductive representation learning on large graphs

  • William L. Hamilton, Rex Ying, Jure Leskovec
  • Problem: given a graph where each node has a set of (possibly varied) attributes, create a 'embedding' vector at each node that describes both the node and the network that surrounds it.
  • To this point (2017) there were two ways of doing this -- through matrix factorization methods, and through graph convolutional networks.
    • The matrix factorization methods or spectral methods (similar to multi-dimensional scaling, where points are projected onto a plane to preserve a distance metric) are transductive : they work entirely within-data, and don't directly generalize to new data.
      • This is parsimonious in some sense, but doesn't work well in the real world, where datasets are constantly changing and frequently growing.
  • Their approach is similar to graph convolutional networks, where (I think) the convolution is indexed by node distances.
  • General idea: each node starts out with an embedding vector = its attribute or feature vector.
  • Then, all neighboring nodes are aggregated by sampling a fixed number of the nearest neighbors (fixed for computational reasons).
    • Aggregation can be mean aggregation, LSTM aggregation (on random permuations of the neighbor nodes), or MLP -> nonlinearity -> max-pooling. Pooling has the most wins, though all seem to work...
  • The aggregated vector is concatenated with the current node feature vector, and this is fed through a learned weighting matrix and nonlinearity to output the feature vector for the current pass.
  • Passes proceed from out-in... i think.
  • Algorithm is inspired by the Weisfeiler-Lehman Isomorphism Test, which updates neighbor counts per node to estimate if graphs are isomorphic. They do a similar thing here, only with vectors not scalars, and similarly take into account the local graph structure.
    • All the aggregator functions, and for course the nonlinearities and weighting matricies, are differentiable -- so the structure is trained in a supervised way with SGD.

This is a well-put together paper, with some proofs of convergence etc -- but it still feels only lightly tested. As with many of these papers, could benefit from a positive control, where the generating function is known & you can see how well the algorithm discovers it.

Otherwise, the structure / algorithm feels rather intuitive; surprising to me that it was not developed before the matrix factorization methods.

Worth comparing this to word2vec embeddings, where local words are used to predict the current word & the resulting vector in the neck-down of the NN is the representation.

{1515}
hide / / print
ref: -0 tags: bleaching STED dye phosphorus japan date: 07-16-2020 14:06 gmt revision:1 [0] [head]

Super-Photostable Phosphole-Based Dye for Multiple-Acquisition Stimulated Emission Depletion Imaging

  • Use the electron withdrawing ability of a phosphole group (P = O) to reduce photobleaching
  • Derived from another photostable dye, C-Naphox, only with a different mechanism of fluorescence -- pi-pi* transfer rather than intramolecular charge transfer (ICT).
  • Much more stable than Alexa 488 (aka sulfonated fluorescein, which is not the most stable dye..)
  • Suitable for multiple STED images, unlike the other dyes. (Note!)

{1490}
hide / / print
ref: -2011 tags: two photon cross section fluorescent protein photobleaching Drobizhev date: 07-10-2020 21:09 gmt revision:8 [7] [6] [5] [4] [3] [2] [head]

PMID-21527931 Two-photon absorption properties of fluorescent proteins

  • Significant 2-photon cross section of red fluorescent proteins (same chromophore as DsRed) in the 700 - 770nm range, accessible to Ti:sapphire lasers ...
    • This corresponds to a S 0S nS_0 \rightarrow S_n transition
    • But but, photobleaching is an order of magnitude slower when excited by the direct S 0S 1S_0 \rightarrow S_1 transition (but the fluorophores can be significantly less bright in this regime).
      • Quote: the photobleaching of DsRed slows down by an order of magnitude when the excitation wavelength is shifted to the red, from 750 to 950 nm (32).
    • See also PMID-18027924
  • Further work by same authors: Absolute Two-Photon Absorption Spectra and Two-Photon Brightness of Orange and Red Fluorescent Proteins
    • " TagRFP possesses the highest two-photon cross section, σ2 = 315 GM, and brightness, σ2φ = 130 GM, where φ is the fluorescence quantum yield. At longer wavelengths, 1000–1100 nm, tdTomato has the largest values, σ2 = 216 GM and σ2φ = 120 GM, per protein chain. Compared to the benchmark EGFP, these proteins present 3–4 times improvement in two-photon brightness."
    • "Single-photon properties of the FPs are poor predictors of which fluorescent proteins will be optimal in two-photon applications. It follows that additional mutagenesis efforts to improve two-photon cross section will benefit the field."
  • 2P cross-section in both the 700-800nm and 1000-1100 nm range corresponds to the chromophore polarizability, and is not related to 1p cross section.
  • This can be useflu for multicolor imaging: excitation of the higher S0 → Sn transition of TagRFP simultaneously with the first, S0 → S1, transition of mKalama1 makes dual-color two-photon imaging possible with a single excitation laser wavelength (13)
  • Why are red GECIs based on mApple (rGECO1) or mRuby (RCaMP)? dsRed2 or TagRFP are much better .. but maybe they don't have CP variants.
  • from https://elifesciences.org/articles/12727

{1513}
hide / / print
ref: -0 tags: constitutional law supreme court date: 06-03-2020 01:40 gmt revision:0 [head]

Spent a while this evening reading about Qualified Immunity -- the law that permits government officials (e.g. police officers) immunity when 'doing their jobs'. It's perhaps one root of the George Floyd / racism protests, as it has set a precedent that US police can be violent and get away with it. (This is also related to police unions and collective liability loops... anyway)

The supreme court has the option to take cases challenging the constitutionality of Qualified Immunity, which many on both sides of the political spectrum want them to do.

It 'got' this power via Marbury vs. Madison. M v. M is self-referential genius:

  • They ruled the original action (blocking an appointment) was illegal
  • but the court does not have the power to make these decisions
  • because the congressional law that gave the Supreme Court that power was unconstitutional.
  • Instead, the supreme court has the power to decide if laws (in this case, those governing its jurisdiction) are constitutional.
  • E.g. SCOTUS initiated judicial review & expansion of it's jurisdiction over Congressional law by repealing a law that expanded it's jurisdiction by congress.
  • This was also done while threading the loops to satisfy then-present political pressure (who wanted the original appointment to be illegal) so that they (Thomas Jefferson) were aligned with the increase in power, so the precedent could persist.

As a person curious how systems gain complexity and feedback loops ... so much nerdgasm.

{1512}
hide / / print
ref: -0 tags: rutherford journal computational theory neumann complexity wolfram date: 05-05-2020 18:15 gmt revision:0 [head]

The Structures for Computation and the Mathematical Structure of Nature

  • Broad, long, historical.

{1510}
hide / / print
ref: -2017 tags: google deepmind compositional variational autoencoder date: 04-08-2020 01:16 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

SCAN: learning hierarchical compositional concepts

  • From DeepMind, first version Jul 2017 / v3 June 2018.
  • Starts broad and strong:
    • "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules"
      • Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details..
    • "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts"
    • "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts."
    • "Compositionality is at the core of such human abilities as creativity, imagination, and language-based communication.
    • This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision.
  • Approach:
    • Factorize the visual world with a Β\Beta -VAE to learn a set of representational primitives through unsupervised exposure to visual data.
    • Expose SCAN (or rather, a module of it) to a small number of symbol-image pairs, from which the algorithm identifies the set if visual primitives (features from beta-VAE) that the examples have in common.
      • E.g. this is purely associative learning, with a finite one-layer association matrix.
    • Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..)
    • Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( \cup ), IN-COMMON ( \cap ) & IGNORE ( \setminus or '-'). This is via a low-parameter convolutional model.
  • Notation:
    • q ϕ(z x|x)q_{\phi}(z_x|x) is the encoder model. ϕ\phi are the encoder parameters, xx is the visual input, z xz_x are the latent parameters inferred from the scene.
    • p theta(x|z x)p_{theta}(x|z_x) is the decoder model. xp θ(x|z x)x \propto p_{\theta}(x|z_x) , θ\theta are the decoder parameters. xx is now the reconstructed scene.
  • From this, the loss function of the beta-VAE is:
    • 𝕃(θ,ϕ;x,z x,β)=𝔼 q ϕ(z x|x)[logp θ(x|z x)]βD KL(q ϕ(z x|x)||p(z x)) \mathbb{L}(\theta, \phi; x, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_x|x)} [log p_{\theta}(x|z_x)] - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x)) where Β>1\Beta \gt 1
      • That is, maximize the auto-encoder fit (the expectation of the decoder, over the encoder output -- aka the pixel log-likelihood) minus the KL divergence between the encoder distribution and p(z x)p(z_x)
        • p(z)𝒩(0,I)p(z) \propto \mathcal{N}(0, I) -- diagonal normal matrix.
        • β\beta comes from the Lagrangian solution to the constrained optimization problem:
        • max ϕ,θ𝔼 xD[𝔼 q ϕ(z|x)[logp θ(x|z)]]\max_{\phi,\theta} \mathbb{E}_{x \sim D} [\mathbb{E}_{q_{\phi}(z|x)}[log p_{\theta}(x|z)]] subject to D KL(q ϕ(z|x)||p(z))<εD_{KL}(q_{\phi}(z|x)||p(z)) \lt \epsilon where D is the domain of images etc.
      • Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual de-tangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising auto-encoder ref, which uses the feature L2 norm instead of the pixel log-likelihood:
    • 𝕃(θ,ϕ;X,z x,β)=𝔼 q ϕ(z x|x)||J(x^)J(x)|| 2 2βD KL(q ϕ(z x|x)||p(z x)) \mathbb{L}(\theta, \phi; X, z_x, \beta) = -\mathbb{E}_{q_{\phi}(z_x|x)}||J(\hat{x}) - J(x)||_2^2 - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x)) where J: WxHxC NJ : \mathbb{R}^{W x H x C} \rightarrow \mathbb{R}^N maps from images to high-level features.
      • This J(x)J(x) is from another neural network (transfer learning) which learns features beforehand.
      • It's a multilayer perceptron denoising autoencoder [Vincent 2010].
  • The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs yy and the latent outputs from encoder z xz_x given xx .
  • In this way, they can present a description yy to the network, which is then recomposed into z yz_y , that then produces an image x^\hat{x} .
    • The whole network is trained by minimizing:
    • 𝕃 y(θ y,ϕ y;y,x,z y,β,λ)=1 st2 nd3 rd \mathbb{L}_y(\theta_y, \phi_y; y, x, z_y, \beta, \lambda) = 1^{st} - 2^{nd} - 3^{rd}
      • 1st term: 𝔼 q ϕ y(z y|y)[logp θ y(y|z y)] \mathbb{E}_{q_{\phi_y}(z_y|y)}[log p_{\theta_y} (y|z_y)] log-likelihood of the decoded symbols given encoded latents z yz_y
      • 2nd term: βD KL(q ϕ y(z y|y)||p(z y)) \beta D_{KL}(q_{\phi_y}(z_y|y) || p(z_y)) weighted KL divergence between encoded latents and diagonal normal prior.
      • 3rd term: λD KL(q ϕ x(z x|y)||q ϕ y(z y|y))\lambda D_{KL}(q_{\phi_x}(z_x|y) || q_{\phi_y}(z_y|y)) weighted KL divergence between latents from the images and latents from the description yy .
        • They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right.
  • Final element! A convolutional recombination element, implemented as a tensor product between z y1z_{y1} and z y2z_{y2} that outputs a one-hot encoding of set-operation that's fed to a (hardcoded?) transformation matrix.
    • I don't think this is great shakes. Could have done this with a small function; no need for a neural network.
    • Trained with very similar loss function as SCAN or the beta-VAE.

  • Testing:
  • They seem to have used a very limited subset of "DeepMind Lab" -- all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing.
  • This is marginally more interesting -- the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.)
  • Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs.

{1511}
hide / / print
ref: -2020 tags: evolution neutral drift networks random walk entropy population date: 04-08-2020 00:48 gmt revision:0 [head]

Localization of neutral evolution: selection for mutational robustness and the maximal entropy random walk

  • The take-away of the paper is that, with larger populations, random mutation and recombination make areas of the graph that take several steps to get to (in the figure, this is Maynard Smith's four-letter mutation word game) are less likely to be visited with a larger population.
  • This is because the recombination serves to make the population adhere more closely to the 'giant' mode. In Maynard's game, this is 2268 words of 2405 meaningful words that can be reached by successive letter changes.
  • The author extends it to van Nimwegen's 1999 paper / RNA genotype-secondary structure. It's not as bad as Maynard's game, but still has much lower graph-theoretic entropy than the actual population.
    • He suggests if the entropic size of the giant component is much smaller than it's dictionary size, then populations are likely to be trapped there.

  • Interesting, but I'd prefer to have an expert peer-review it first :)

{1506}
hide / / print
ref: -0 tags: asymmetric locality sensitive hash maximum inner product search sparsity date: 03-30-2020 02:17 gmt revision:5 [4] [3] [2] [1] [0] [head]

Improved asymmetric locality sensitive hashing for maximum inner product search

  • Like many other papers, this one is based on a long lineage of locality-sensitive hashing papers.
  • Key innovation, in [23] The power of asymmetry in binary hashing, was the development of asymmetric hashing -- the hash function of the query is different than the hash function used for storage. Roughly, this allows additional degrees of freedom since the similarity-function is (in the non-normalized case) non-symmetric.
    • For example, take query Q = [1 1] with keys A = [1 -1] and B = [3 3]. The nearest neighbor is A (distance 2), whereas the maximum inner product is B (inner product 6).
    • Alternately: self-inner product for Q and A is 2, whereas for B it's 18. Self-similarity is not the highest with inner products.
    • Norm of the query does not have an effect on the arg max of the search, though. Hence, for the paper assume that the query has been normalized for MIPS.
  • In this paper instead they convert MIPS into approximate cosine similarity search (which is like normalized MIPS), which can be efficiently solved with signed random projections.
  • (Established): LSH-L2 distance:
    • Sample a random vector a, iid normal N(0,1)
    • Sample a random normal b between 0 and r
      • r is the window size / radius (free parameters?)
    • Hash function is then the floor of the inner product of the vector a and input x + b divided by the radius.
      • I'm not sure about how the floor op is converted to bits of the actual hash -- ?
  • (Established): LSH-correlation, signed random projections h signh^{sign} :
    • Hash is the sign of the inner product of the input vector and a uniform random vector a.
    • This is a two-bit random projection [13][14].
  • (New) Asymmetric-LSH-L2:
    • P(x)=[x;||x|| 2 2;||x|| 2 4;....;||x|| 2 2 m]P(x) = [x;||x||^2_2; ||x||^4_2; .... ; ||x||^{2^m}_2] -- this is the pre-processing hashing of the 'keys'.
      • Requires that then norm of these keys, {||x||}_2 < U < 1$$
      • m3 m \geq 3
    • Q(x)=[x;1/2;1/2;...;1/2]Q(x) = [x;1/2; 1/2; ... ; 1/2] -- hashing of the queries.
    • See the mathematical explanation in the paper, but roughly "transformations P and Q, when normas are less than 1, provide correction to the L2 distance ||Q(p)P(x i)|| 2||Q(p) - P(x_i)||_2 , making in rank correlate with un-normalized inner product."
  • They then change the augmentation to:
    • P(x)=[x;1/2||x|| 2 2;1/2||x|| 2 4;...;1/2||x|| 2 2 m]P(x) = [x; 1/2 - ||x||^2_2; 1/2 - ||x||^4_2; ... ; 1/2 - ||x||^{2^m}_2]
    • Q(x)=[x;0;...;0]Q(x) = [x; 0; ...; 0]
    • This allows use of signed nearest-neighbor search to be used in the MIPS problem. (e.g. the hash is the sign of P and Q, per above; I assume this is still a 2-bit operation?)
  • Then the expand the U,M compromise function ρ\rho to allow for non-normalized queries. U depends on m and c (m is the codeword extension, and c is the ratio between o-target and off-target hash hits.
  • Tested on Movielens and Netflix databases, this using SVD preprocessing on the user-item matrix (full rank matrix indicating every user rating on every movie (mostly zeros!)) to get at the latent vectors.
  • In the above plots, recall (hah) that precision is the number of true positives / number of false positives as the number of draws k increases; recall is the number of true positives / number of draws k.
    • Clearly, the curve bends up and to the right when there are a lot of hash tables K.
    • Example datapoint: 50% precision at 40% recall, top 5. So on average you get 2 correct hits in 4 draws. Or: 40% precision, 20% recall, top 10: 2 hits in 5 draws. 20/40: 4 hits in 20 draws. (hit: correctly within the top-N)
    • So ... it's not that great.

Use case: Capsule: a camera based positioning system using learning
  • Uses 512 SIFT features as keys and queries to LSH. Hashing is computed via sparse addition / subtraction algorithm, with K bits per hash table (not quite random projections) and L hash tables. K = 22 and L = 24. ~ 1000 training images.
  • Best matching image is used as the location of the current image.

{1500}
hide / / print
ref: -0 tags: reinforcement learning distribution DQN Deepmind dopamine date: 03-30-2020 02:14 gmt revision:5 [4] [3] [2] [1] [0] [head]

PMID-31942076 A distributional code for value in dopamine based reinforcement learning

  • Synopsis is staggeringly simple: dopamine neurons encode / learn to encode a distribution of reward expectations, not just the mean (aka the expected value) of the reward at a given state-action pair.
  • This is almost obvious neurally -- of course dopamine neurons in the striatum represent different levels of reward expectation; there is population diversity in nearly everything in neuroscience. The new interpretation is that neurons have different slopes for their susceptibility to positive and negative rewards (or rather, reward predictions), which results in different inflection points where the neurons are neutral about a reward.
    • This constitutes more optimistic and pessimistic neurons.
  • There is already substantial evidence that such a distributional representation enhances performance in DQN (Deep q-networks) from circa 2017; the innovation here is that it has been extended to experiments from 2015 where mice learned to anticipate water rewards with varying volume, or varying probability of arrival.
  • The model predicts a diversity of asymmetry below and above the reversal point
  • Also predicts that the distribution of reward responses should be decoded by neural activity ... which it is ... but it is not surprising that a bespoke decoder can find this information in the neural firing rates. (Have not examined in depth the decoding methods)
  • Still, this is a clear and well-written, well-thought out paper; glad to see new parsimonious theories about dopamine out there.

{1508}
hide / / print
ref: -0 tags: date: 03-30-2020 02:09 gmt revision:1 [0] [head]

SLIDE: in defense of smart algorithms over hardware acceleration for large-scale deep learning systems

  • Modeled directly on {1505} - Scalable and sustainable deep learning via randomized hashing.
  • Much emphasis on the paper on performance and tuning rather than theoretical or computational advances.
    • This with explicitly wide and sparse classification datasets.
  • Free parameters:
    • L -- number of hash tables per layer.
    • K -- size of hash code, in bits, per table.
  • Definitely much faster -- but why can't you port the sparse LSH algorithms to a GPU & increase speed further?
  • Architecture follows Loss Decomposition for Fast Learning in Large Output Spaces
    • Fully connected neural network with one hidden layer and a batch size of 128 for Delicious and 256 for Amazon-670k.
    • I don't think they performed the decomposition of the weight matrix, which requires the message-passing iteration to set the backprop loss-function weights. No mention of such message-passing. (??)
  • Two hash functions: Simhash and densified winner take all (DWTA). DWTA is based on the observation that if the input data is very sparse, then the hash functions will not hash well.
    • Delicious-200k used Simhash, K = 9, L = 50;
    • Amazon-670 used DWTA hash, K = 8, L = 50;
  • "It should be noted that if we compute activation for s << 1 fraction of neurons in each layer on average, the fraction of weights that needs to be updated is only s 2s^2 ." (Since the only weights that are updated are the intersection of active pre and post.)
  • Updates are performed in a HOGWILD manner, where some overlap in weight updates (which are all computed in parallel) is tolerable for convergence.
  • Updates to the hash tables, however, are not computed every SGD iteration. Instead, they are scheduled with an exponential decay term -- e.g. the time between updates increases as the network converges. This is because the weight changes in the beginning are smaller than those at the end. Initial hash update is every 50 gradient updates.
    • For Simhash, which uses inner product with random vectors of {+1, 0, -1} (so that you don't need multiplies, only addition and subtraction), savings can be further extended to only re-compute the hashes with the changed weights. As noted above, the high level of unit sparsity makes these weight changes quadratically sparse.
  • Test their hashing-optimized deep learning algorithm on Delicious-200k and Amazon-670k, both forms of extreme classification with a very wide output layer. Authors suggest that most of the computational expense is in this last layer, same as 'loss decomp for fast learning in large output spaces'.

{1505}
hide / / print
ref: -2016 tags: locality sensitive hash deep learning regularization date: 03-30-2020 02:07 gmt revision:5 [4] [3] [2] [1] [0] [head]

Scalable and sustainable deep learning via randomized hashing

  • Central idea: replace dropout, adaptive dropout, or winner-take-all with a fast (sublinear time) hash based selection of active nodes based on approximate MIPS (maximum inner product search) using asymmetric locality-sensitive hashing.
    • This avoids a lot of the expensive inner-product multiply-accumulate work & energy associated with nodes that will either be completely off due to the ReLU or other nonlinearity -- or just not important for the algorithm + current input.
    • The result shows that you don't need very many neurons active in a given layer for successful training.
  • C.f: adaptive dropout adaptively chooses the nodes based on their activations. A few nodes are sampled from the network probabalistically based on the node activations dependent on their current input.
    • Adaptive dropouts demonstrate better performance than vanilla dropout [44]
    • It is possible to drop significantly more nodes adaptively than without while retaining superior performance.
  • WTA is an extreme form of adaptive dropout that uses mini-batch statistics to enforce a sparsity constraint. [28] {1507} Winner take all autoencoders
  • Our approach uses the insight that selecting a very sparse set of hidden nodes with the highest activations can be reformulated as dynamic approximate query processing, solvable with LSH.
    • LSH can be sub-linear time; normal processing involves the inner product.
    • LSH maps similar vectors into the same bucket with high probability. That is, it maps vectors into integers (bucket number)
  • Similar approach: Hashed nets [6], which aimed to decrease the number of parameters in a network by using a universal random hash function to tie weights. Compressing neural networks with the Hashing trick
    • "HashedNets uses a low-cost hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value."
  • Ref [38] shows how asymmetric hash functions allow LSH to be converted to a sub-linear time algorithm for maximum inner product search (MIPS).
  • Used multi-probe LSH: rather than having a large number of hash tables (L) which increases hash time and memory use, they probe close-by buckets in the hash tables. That is, they probe bucket at B_j(Q) and those for slightly perturbed query Q. See ref [26].
  • See reference [2] for theory...
  • Following ref [42], use K randomized hash functions to generate the K data bits per vector. Each bit is the sign of the asymmetric random projection. Buckets contain a pointer to the node (neuron); only active buckets are kept around.
    • The K hash functions serve to increase the precision of the fingerprint -- found nodes are more expected to be active.
    • Have L hash tables for each hidden layer; these are used to increase the probability of finding useful / active nodes due to the randomness of the hash function.
    • Hash is asymmetric in the sense that the query and collection data are hashed independently.
  • In every layer during SGD, compute K x L hashes of the input, probe about 10 L buckets, and take their union. Experiments: K = 6 and L = 5.
  • See ref [30] where authors show around 500x reduction in computations for image search following different algorithmic and systems choices. Capsule: a camera based positioning system using learning {1506}
  • Use relatively small test data sets -- MNIST 8M, NORB, Convex, Rectangles -- each resized to have small-ish input vectors.

  • Really want more analysis of what exactly is going on here -- what happens when you change the hashing function, for example? How much is the training dependent on suitable ROC or precision/recall on the activation?
    • For example, they could have calculated the actual real activation & WTA selection, and compared it to the results from the hash function; how correlated are they?

{1509}
hide / / print
ref: -2002 tags: hashing frequent items count sketch algorithm google date: 03-30-2020 02:04 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

Finding frequent items in data streams

  • Notation:
    • S is a data stream, S=q 1,q 2,...,q n S = q_1, q_2, ..., q_n length n.
    • Each object q iO=o 1,...o mq_i \in O = {o_1, ... o_m} That is, there are m total possible objects (e.g. English words).
    • Object o i o_i occurs n in_i times in S. The o no_n are ordered so that n 1n 2n m n_1 \geq n_2 \geq n_m .
  • Task:
    • Given an input stream S, integer k, and real ε\epsilon
    • Output a list of k elements from S such that each element has n i>(1ε)n k n_i \gt (1-\epsilon)n_k .
      • That is, if the ordering is perfect, n in k n_i \geq n_k , with equality on the last element.
  • Algorithm:
    • h 1,...,h th_1, ..., h_t hashes from object q to buckets 1,...,b{1, ..., b}
    • s 1,...,s ts_1, ..., s_t hashes from object q to 1,+1{-1, +1}
    • For each symbol, add it to the 2D hash array by hashing first with h ih_i , then increment that counter with s is_i .
      • The double-hasihing is to reduce the effect of collisions with high-frequency items.
    • When querying for frequency of a object, hash like others, and take the median over i of h i[q]*s i[q] h_i[q] * s_i[q]
    • t=O(log(nδ))t = O(log(\frac{n}{\delta})) where the algorithm fails with at most probability δ\delta
  • Demonstrate proof of convergence / function with Zipfian distributions with varying exponent. (I did not read through this).
  • Also showed that it's possible to compare these hash-counts directly to see what's changed,or importantly if the documents are different.


Mission: Ultra large-scale feature selection using Count-Sketches
  • Task:
    • Given a labeled dataset (X i,y i)(X_i, y_i) for i1,2,...,ni \in {1,2, ..., n} and X i p,y iX_i \in \mathbb{R}^p, y_i \in \mathbb{R}
    • Find the k-sparse feature vector / linear regression for the mean squares problem min||B|| 0=k||yXΒ|| 2 \frac{min}{||B||_0=k} ||y-X\Beta||_2
      • ||B|| 0=k ||B||_0=k counts the non-zero elements in the feature vector.
    • THE number of features pp is so large that a dense Β\Beta cannot be stored in memory. (X is of course sparse).
  • Such data may be from ad click-throughs, or from genomic analyses ...
  • Use the count-sketch algorithm (above) for capturing & continually updating the features for gradient update.
    • That is, treat the stream of gradient updates, in the normal form g i=2λ(y iX iΒ iX t) tX ig_i = 2 \lambda (y_i - X_i \Beta_i X^t)^t X_i , as the semi-continuous time series used above as SS
  • Compare this with greedy thresholding, Iterative hard thresholding (IHT) e.g. throw away gradient information after each batch.
    • This discards small gradients which may be useful for the regression problem.
  • Works better, but not necessarily better than straight feature hashing (FH).
  • Meh.

{1507}
hide / / print
ref: -2015 tags: winner take all sparsity artificial neural networks date: 03-28-2020 01:15 gmt revision:0 [head]

Winner-take-all Autoencoders

  • During training of fully connected layers, they enforce a winner-take all lifetime sparsity constraint.
    • That is: when training using mini-batches, they keep the k percent largest activation of a given hidden unit across all samples presented in the mini-batch. The remainder of the activations are set to zero. The units are not competing with each other; they are competing with themselves.
    • The rest of the network is a stack of ReLU layers (upon which the sparsity constraint is applied) followed by a linear decoding layer (which makes interpretation simple).
    • They stack them via sequential training: train one layer from the output of another & not backprop the errors.
  • Works, with lower sparsity targets, also for RBMs.
  • Extended the result to WTA covnets -- here enforce both spatial and temporal (mini-batch) sparsity.
    • Spatial sparsity involves selecting the single largest hidden unit activity within each feature map. The other activities and derivatives are set to zero.
    • At test time, this sparsity constraint is released, and instead they use a 4 x 4 max-pooling layer & use that for classification or deconvolution.
  • To apply both spatial and temporal sparsity, select the highest spatial response (e.g. one unit in a 2d plane of convolutions; all have the same weights) for each feature map. Do this for every image in a mini-batch, and then apply the temporal sparsity: each feature map gets to be active exactly once, and in that time only one hidden unit (or really, one location of the input and common weights (depending on stride)) undergoes SGD.
    • Seems like it might train very slowly. Authors didn't note how many epochs were required.
  • This, too can be stacked.
  • To train on larger image sets, they first extract 48 x 48 patches & again stack...
  • Test on MNIST, SVHN, CIFAR-10 -- works ok, and well even with few labeled examples (which is consistent with their goals)

{1428}
hide / / print
ref: -0 tags: VARNUM GEVI genetically encoded voltage indicators FRET Ace date: 03-18-2020 17:12 gmt revision:5 [4] [3] [2] [1] [0] [head]

PMID-30420685 Fast in-vivo voltage imaging using a red fluorescent indicator

  • Kannan M, Vasan G, Huang C, Haziza S, Li JZ, Inan H, Schnitzer MJ, Pieribone VA.
  • Other genetically encoded voltage indicators (GEVI):
    • PMID-22958819 ArcLight (Peribone also last author) ; sign of ΔF/F\Delta F / F negative, but large, 35%! Slow tho? improvement in speed
    • ASAP3 ΔF/F\Delta F / F large, τ=3ms.\tau = 3 ms.
    • PMID-26586188 Ace-mNeon FRET based, Acetabularia opsin, fast kinetics + brightness of mNeonGreen.
    • Archon1 -- fast and sensitive, found (like VARNUM) using a robotic directed evolution or direct search strategy.
  • VARNAM is based on Acetabularia (Ace) + mRuby3, also FRET based, found via high-throughput voltage screen.
  • Archaerhodopsin require 1-12 W/mm^2 of illumination, vs. 50 mw/mm^2 for GFP based probes. Lots of light!
  • Systematic optimization of voltage sensor function: both the linker region (288 mutants), which affects FRET efficiency, as well as the opsin fluorophore region (768 mutants), which affects the wavelength of absorption / emission.
  • Some intracellular clumping (which will negatively affect sensitivity), but mostly localized to the membrane.
  • Sensitivity is still imperfect -- 4% in-vivo cortical neurons, though it’s fast enough to resolve 100 Hz spiking.
  • Can resolve post-synaptic EPSCs, but < 1 % ΔF/F\Delta F/F .
  • Tested all-optical ephys using VARNAM + blueshifted channelrhodopsin, CheRiff, both sparsely, and in PV targeted transgenetic model. Both work, but this is a technique paper; no real results.
  • Tested TEMPO fiber-optic recording in freely behaving mice (ish) -- induced ketamine waves, 0.5-4Hz.
  • And odor-induced activity in flies, using split-Gal4 expression tools. So many experiments.

{1501}
hide / / print
ref: -2019 tags: Vale photostability bioarxiv DNA oragami photobleaching date: 03-10-2020 21:59 gmt revision:5 [4] [3] [2] [1] [0] [head]

A 6-nm ultra-photostable DNA Fluorocube for fluorescence imaging

  • Cy3n = sulfonated version of Cy3.
  • JF549 = azetidine modified version of tetramethyl rhodamine.

Also including some correspondence with the authors:

Me

Nice work and nice paper, thanks for sharing .. and not at all what I had expected from Ron's comments! Below are some comments ... would love your opinion.

I'd expect that the molar absorption coefficients for the fluorocubes should be ~6x larger than for the free dyes and the single dye cubes (measured?), yet the photon yields for all except Cy3N maybe are around the yield for one dye molecule. So the quantum yield must be decreased by ~6x?

This in turn might be from a middling FRET which reduces lifetime, thereby the probability of ISC, photoelectron transfer, and hence photobleaching.

I wonder if in the case of ATTO 647N Cy5 and Cy3, the DNA is partly shielding the fluorphores from solvent (ala ethidium bromide), which also helps with stability, just like in fluorescent proteins. ATTO 647N generates a lot of singlet oxygen, who knows what it's doing to DNA.

Can you do a log-log autocorrelation of the blinking timeseries of the constructs? This may reveal different rate constants controlling dark/light states (though, for 6 coupled objects, might not be interpretable!)

Also, given the effect of DNA shielding, have you compared to free dyes to single-dye cubes other than supp fig 10? The fact that sulfonation made such a huge effect in brightness is suggestive.

Again, these are super interesting & exciting results!

Author

I haven't directly looked at the molar absorption coefficient but judging from the data that I collected for the absorption spectra, there is certainly an increase for the fluorocubes compared to single dyes. I agree that this would be an interesting experiment and I am planning collect data to measure the molar absorption coefficient. I would also expect a ~6 fold increase for the Fluorocubes.

Yes, we suspect homo FRET to help reduce photobleaching. So far we only measured lifetimes in bulk but are planning to obtain lifetime data on the single-molecule level soon.

We also wondered if the DNA is providing some kind of shield for the fluorophores but could not design an experiment to directly test this hypothesis. If you have a suggestion, that would be wonderful.

The log-log autocorrelation of blinking events is indeed difficult to interpret. Already individual intensity traces of fluorocubes are difficult to analyze as many of them get brighter before they bleach. We are also wondering if some fluorocubes are emitting two photons simultaneously. We will hopefully be able to measure this soon.