use https for features.
text: sort by
tags: modified
type: chronology
hide / / print
ref: -0 tags: rutherford journal computational theory neumann complexity wolfram date: 05-05-2020 18:15 gmt revision:0 [head]

The Structures for Computation and the Mathematical Structure of Nature

  • Broad, long, historical.

hide / / print
ref: -2017 tags: google deepmind compositional variational autoencoder date: 04-08-2020 01:16 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

SCAN: learning hierarchical compositional concepts

  • From DeepMind, first version Jul 2017 / v3 June 2018.
  • Starts broad and strong:
    • "The seemingly infinite diversity of the natural world from a relatively small set of coherent rules"
      • Relative to what? What's the order of magnitude here? In personal experience, each domain involves a large pile of relevant details..
    • "We conjecture that these rules dive rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts"
    • "If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts."
    • "Compositionality is at the core of such human abilities as creativity, imagination, and language-based communication.
    • This addresses the limitations of deep learning, which are overly data hungry (low sample efficiency), tend to overfit the data, and require human supervision.
  • Approach:
    • Factorize the visual world with a Β\Beta -VAE to learn a set of representational primitives through unsupervised exposure to visual data.
    • Expose SCAN (or rather, a module of it) to a small number of symbol-image pairs, from which the algorithm identifies the set if visual primitives (features from beta-VAE) that the examples have in common.
      • E.g. this is purely associative learning, with a finite one-layer association matrix.
    • Test on both image 2 symbols and symbols to image directions. For the latter, allow irrelevant attributes to be filled in from the priors (this is important later in the paper..)
    • Add in a third module, which allows learning of compositions of the features, ala set notation: AND ( \cup ), IN-COMMON ( \cap ) & IGNORE ( \setminus or '-'). This is via a low-parameter convolutional model.
  • Notation:
    • q ϕ(z x|x)q_{\phi}(z_x|x) is the encoder model. ϕ\phi are the encoder parameters, xx is the visual input, z xz_x are the latent parameters inferred from the scene.
    • p theta(x|z x)p_{theta}(x|z_x) is the decoder model. xp θ(x|z x)x \propto p_{\theta}(x|z_x) , θ\theta are the decoder parameters. xx is now the reconstructed scene.
  • From this, the loss function of the beta-VAE is:
    • 𝕃(θ,ϕ;x,z x,β)=𝔼 q ϕ(z x|x)[logp θ(x|z x)]βD KL(q ϕ(z x|x)||p(z x)) \mathbb{L}(\theta, \phi; x, z_x, \beta) = \mathbb{E}_{q_{\phi}(z_x|x)} [log p_{\theta}(x|z_x)] - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x)) where Β>1\Beta \gt 1
      • That is, maximize the auto-encoder fit (the expectation of the decoder, over the encoder output -- aka the pixel log-likelihood) minus the KL divergence between the encoder distribution and p(z x)p(z_x)
        • p(z)𝒩(0,I)p(z) \propto \mathcal{N}(0, I) -- diagonal normal matrix.
        • β\beta comes from the Lagrangian solution to the constrained optimization problem:
        • max ϕ,θ𝔼 xD[𝔼 q ϕ(z|x)[logp θ(x|z)]]\max_{\phi,\theta} \mathbb{E}_{x \sim D} [\mathbb{E}_{q_{\phi}(z|x)}[log p_{\theta}(x|z)]] subject to D KL(q ϕ(z|x)||p(z))<εD_{KL}(q_{\phi}(z|x)||p(z)) \lt \epsilon where D is the domain of images etc.
      • Claim that this loss function tips the scale too far away from accurate reconstruction with sufficient visual de-tangling (that is: if significant features correspond to small details in pixel space, they are likely to be ignored); instead they adopt the approach of the denoising auto-encoder ref, which uses the feature L2 norm instead of the pixel log-likelihood:
    • 𝕃(θ,ϕ;X,z x,β)=𝔼 q ϕ(z x|x)||J(x^)J(x)|| 2 2βD KL(q ϕ(z x|x)||p(z x)) \mathbb{L}(\theta, \phi; X, z_x, \beta) = -\mathbb{E}_{q_{\phi}(z_x|x)}||J(\hat{x}) - J(x)||_2^2 - \beta D_{KL} (q_{\phi}(z_x|x)|| p(z_x)) where J: WxHxC NJ : \mathbb{R}^{W x H x C} \rightarrow \mathbb{R}^N maps from images to high-level features.
      • This J(x)J(x) is from another neural network (transfer learning) which learns features beforehand.
      • It's a multilayer perceptron denoising autoencoder [Vincent 2010].
  • The SCAN architecture includes an additional element, another VAE which is trained simultaneously on the labeled inputs yy and the latent outputs from encoder z xz_x given xx .
  • In this way, they can present a description yy to the network, which is then recomposed into z yz_y , that then produces an image x^\hat{x} .
    • The whole network is trained by minimizing:
    • 𝕃 y(θ y,ϕ y;y,x,z y,β,λ)=1 st2 nd3 rd \mathbb{L}_y(\theta_y, \phi_y; y, x, z_y, \beta, \lambda) = 1^{st} - 2^{nd} - 3^{rd}
      • 1st term: 𝔼 q ϕ y(z y|y)[logp θ y(y|z y)] \mathbb{E}_{q_{\phi_y}(z_y|y)}[log p_{\theta_y} (y|z_y)] log-likelihood of the decoded symbols given encoded latents z yz_y
      • 2nd term: βD KL(q ϕ y(z y|y)||p(z y)) \beta D_{KL}(q_{\phi_y}(z_y|y) || p(z_y)) weighted KL divergence between encoded latents and diagonal normal prior.
      • 3rd term: λD KL(q ϕ x(z x|y)||q ϕ y(z y|y))\lambda D_{KL}(q_{\phi_x}(z_x|y) || q_{\phi_y}(z_y|y)) weighted KL divergence between latents from the images and latents from the description yy .
        • They note that the direction of the divergence matters; I suspect it took some experimentation to see what's right.
  • Final element! A convolutional recombination element, implemented as a tensor product between z y1z_{y1} and z y2z_{y2} that outputs a one-hot encoding of set-operation that's fed to a (hardcoded?) transformation matrix.
    • I don't think this is great shakes. Could have done this with a small function; no need for a neural network.
    • Trained with very similar loss function as SCAN or the beta-VAE.

  • Testing:
  • They seem to have used a very limited subset of "DeepMind Lab" -- all of the concept or class labels could have been implimented easily, e.g. single pixel detector for the wall color. Quite disappointing.
  • This is marginally more interesting -- the network learns to eliminate latent factors as it's exposed to examples (just like perhaps a Bayesian network.)
  • Similarly, the CelebA tests are meh ... not a clear improvement over the existing VAEs.

hide / / print
ref: -2020 tags: evolution neutral drift networks random walk entropy population date: 04-08-2020 00:48 gmt revision:0 [head]

Localization of neutral evolution: selection for mutational robustness and the maximal entropy random walk

  • The take-away of the paper is that, with larger populations, random mutation and recombination make areas of the graph that take several steps to get to (in the figure, this is Maynard Smith's four-letter mutation word game) are less likely to be visited with a larger population.
  • This is because the recombination serves to make the population adhere more closely to the 'giant' mode. In Maynard's game, this is 2268 words of 2405 meaningful words that can be reached by successive letter changes.
  • The author extends it to van Nimwegen's 1999 paper / RNA genotype-secondary structure. It's not as bad as Maynard's game, but still has much lower graph-theoretic entropy than the actual population.
    • He suggests if the entropic size of the giant component is much smaller than it's dictionary size, then populations are likely to be trapped there.

  • Interesting, but I'd prefer to have an expert peer-review it first :)

hide / / print
ref: -0 tags: asymmetric locality sensitive hash maximum inner product search sparsity date: 03-30-2020 02:17 gmt revision:5 [4] [3] [2] [1] [0] [head]

Improved asymmetric locality sensitive hashing for maximum inner product search

  • Like many other papers, this one is based on a long lineage of locality-sensitive hashing papers.
  • Key innovation, in [23] The power of asymmetry in binary hashing, was the development of asymmetric hashing -- the hash function of the query is different than the hash function used for storage. Roughly, this allows additional degrees of freedom since the similarity-function is (in the non-normalized case) non-symmetric.
    • For example, take query Q = [1 1] with keys A = [1 -1] and B = [3 3]. The nearest neighbor is A (distance 2), whereas the maximum inner product is B (inner product 6).
    • Alternately: self-inner product for Q and A is 2, whereas for B it's 18. Self-similarity is not the highest with inner products.
    • Norm of the query does not have an effect on the arg max of the search, though. Hence, for the paper assume that the query has been normalized for MIPS.
  • In this paper instead they convert MIPS into approximate cosine similarity search (which is like normalized MIPS), which can be efficiently solved with signed random projections.
  • (Established): LSH-L2 distance:
    • Sample a random vector a, iid normal N(0,1)
    • Sample a random normal b between 0 and r
      • r is the window size / radius (free parameters?)
    • Hash function is then the floor of the inner product of the vector a and input x + b divided by the radius.
      • I'm not sure about how the floor op is converted to bits of the actual hash -- ?
  • (Established): LSH-correlation, signed random projections h signh^{sign} :
    • Hash is the sign of the inner product of the input vector and a uniform random vector a.
    • This is a two-bit random projection [13][14].
  • (New) Asymmetric-LSH-L2:
    • P(x)=[x;||x|| 2 2;||x|| 2 4;....;||x|| 2 2 m]P(x) = [x;||x||^2_2; ||x||^4_2; .... ; ||x||^{2^m}_2] -- this is the pre-processing hashing of the 'keys'.
      • Requires that then norm of these keys, {||x||}_2 < U < 1$$
      • m3 m \geq 3
    • Q(x)=[x;1/2;1/2;...;1/2]Q(x) = [x;1/2; 1/2; ... ; 1/2] -- hashing of the queries.
    • See the mathematical explanation in the paper, but roughly "transformations P and Q, when normas are less than 1, provide correction to the L2 distance ||Q(p)P(x i)|| 2||Q(p) - P(x_i)||_2 , making in rank correlate with un-normalized inner product."
  • They then change the augmentation to:
    • P(x)=[x;1/2||x|| 2 2;1/2||x|| 2 4;...;1/2||x|| 2 2 m]P(x) = [x; 1/2 - ||x||^2_2; 1/2 - ||x||^4_2; ... ; 1/2 - ||x||^{2^m}_2]
    • Q(x)=[x;0;...;0]Q(x) = [x; 0; ...; 0]
    • This allows use of signed nearest-neighbor search to be used in the MIPS problem. (e.g. the hash is the sign of P and Q, per above; I assume this is still a 2-bit operation?)
  • Then the expand the U,M compromise function ρ\rho to allow for non-normalized queries. U depends on m and c (m is the codeword extension, and c is the ratio between o-target and off-target hash hits.
  • Tested on Movielens and Netflix databases, this using SVD preprocessing on the user-item matrix (full rank matrix indicating every user rating on every movie (mostly zeros!)) to get at the latent vectors.
  • In the above plots, recall (hah) that precision is the number of true positives / number of false positives as the number of draws k increases; recall is the number of true positives / number of draws k.
    • Clearly, the curve bends up and to the right when there are a lot of hash tables K.
    • Example datapoint: 50% precision at 40% recall, top 5. So on average you get 2 correct hits in 4 draws. Or: 40% precision, 20% recall, top 10: 2 hits in 5 draws. 20/40: 4 hits in 20 draws. (hit: correctly within the top-N)
    • So ... it's not that great.

Use case: Capsule: a camera based positioning system using learning
  • Uses 512 SIFT features as keys and queries to LSH. Hashing is computed via sparse addition / subtraction algorithm, with K bits per hash table (not quite random projections) and L hash tables. K = 22 and L = 24. ~ 1000 training images.
  • Best matching image is used as the location of the current image.

hide / / print
ref: -0 tags: reinforcement learning distribution DQN Deepmind dopamine date: 03-30-2020 02:14 gmt revision:5 [4] [3] [2] [1] [0] [head]

PMID-31942076 A distributional code for value in dopamine based reinforcement learning

  • Synopsis is staggeringly simple: dopamine neurons encode / learn to encode a distribution of reward expectations, not just the mean (aka the expected value) of the reward at a given state-action pair.
  • This is almost obvious neurally -- of course dopamine neurons in the striatum represent different levels of reward expectation; there is population diversity in nearly everything in neuroscience. The new interpretation is that neurons have different slopes for their susceptibility to positive and negative rewards (or rather, reward predictions), which results in different inflection points where the neurons are neutral about a reward.
    • This constitutes more optimistic and pessimistic neurons.
  • There is already substantial evidence that such a distributional representation enhances performance in DQN (Deep q-networks) from circa 2017; the innovation here is that it has been extended to experiments from 2015 where mice learned to anticipate water rewards with varying volume, or varying probability of arrival.
  • The model predicts a diversity of asymmetry below and above the reversal point
  • Also predicts that the distribution of reward responses should be decoded by neural activity ... which it is ... but it is not surprising that a bespoke decoder can find this information in the neural firing rates. (Have not examined in depth the decoding methods)
  • Still, this is a clear and well-written, well-thought out paper; glad to see new parsimonious theories about dopamine out there.

hide / / print
ref: -0 tags: date: 03-30-2020 02:09 gmt revision:1 [0] [head]

SLIDE: in defense of smart algorithms over hardware acceleration for large-scale deep learning systems

  • Modeled directly on {1505} - Scalable and sustainable deep learning via randomized hashing.
  • Much emphasis on the paper on performance and tuning rather than theoretical or computational advances.
    • This with explicitly wide and sparse classification datasets.
  • Free parameters:
    • L -- number of hash tables per layer.
    • K -- size of hash code, in bits, per table.
  • Definitely much faster -- but why can't you port the sparse LSH algorithms to a GPU & increase speed further?
  • Architecture follows Loss Decomposition for Fast Learning in Large Output Spaces
    • Fully connected neural network with one hidden layer and a batch size of 128 for Delicious and 256 for Amazon-670k.
    • I don't think they performed the decomposition of the weight matrix, which requires the message-passing iteration to set the backprop loss-function weights. No mention of such message-passing. (??)
  • Two hash functions: Simhash and densified winner take all (DWTA). DWTA is based on the observation that if the input data is very sparse, then the hash functions will not hash well.
    • Delicious-200k used Simhash, K = 9, L = 50;
    • Amazon-670 used DWTA hash, K = 8, L = 50;
  • "It should be noted that if we compute activation for s << 1 fraction of neurons in each layer on average, the fraction of weights that needs to be updated is only s 2s^2 ." (Since the only weights that are updated are the intersection of active pre and post.)
  • Updates are performed in a HOGWILD manner, where some overlap in weight updates (which are all computed in parallel) is tolerable for convergence.
  • Updates to the hash tables, however, are not computed every SGD iteration. Instead, they are scheduled with an exponential decay term -- e.g. the time between updates increases as the network converges. This is because the weight changes in the beginning are smaller than those at the end. Initial hash update is every 50 gradient updates.
    • For Simhash, which uses inner product with random vectors of {+1, 0, -1} (so that you don't need multiplies, only addition and subtraction), savings can be further extended to only re-compute the hashes with the changed weights. As noted above, the high level of unit sparsity makes these weight changes quadratically sparse.
  • Test their hashing-optimized deep learning algorithm on Delicious-200k and Amazon-670k, both forms of extreme classification with a very wide output layer. Authors suggest that most of the computational expense is in this last layer, same as 'loss decomp for fast learning in large output spaces'.

hide / / print
ref: -2016 tags: locality sensitive hash deep learning regularization date: 03-30-2020 02:07 gmt revision:5 [4] [3] [2] [1] [0] [head]

Scalable and sustainable deep learning via randomized hashing

  • Central idea: replace dropout, adaptive dropout, or winner-take-all with a fast (sublinear time) hash based selection of active nodes based on approximate MIPS (maximum inner product search) using asymmetric locality-sensitive hashing.
    • This avoids a lot of the expensive inner-product multiply-accumulate work & energy associated with nodes that will either be completely off due to the ReLU or other nonlinearity -- or just not important for the algorithm + current input.
    • The result shows that you don't need very many neurons active in a given layer for successful training.
  • C.f: adaptive dropout adaptively chooses the nodes based on their activations. A few nodes are sampled from the network probabalistically based on the node activations dependent on their current input.
    • Adaptive dropouts demonstrate better performance than vanilla dropout [44]
    • It is possible to drop significantly more nodes adaptively than without while retaining superior performance.
  • WTA is an extreme form of adaptive dropout that uses mini-batch statistics to enforce a sparsity constraint. [28] {1507} Winner take all autoencoders
  • Our approach uses the insight that selecting a very sparse set of hidden nodes with the highest activations can be reformulated as dynamic approximate query processing, solvable with LSH.
    • LSH can be sub-linear time; normal processing involves the inner product.
    • LSH maps similar vectors into the same bucket with high probability. That is, it maps vectors into integers (bucket number)
  • Similar approach: Hashed nets [6], which aimed to decrease the number of parameters in a network by using a universal random hash function to tie weights. Compressing neural networks with the Hashing trick
    • "HashedNets uses a low-cost hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value."
  • Ref [38] shows how asymmetric hash functions allow LSH to be converted to a sub-linear time algorithm for maximum inner product search (MIPS).
  • Used multi-probe LSH: rather than having a large number of hash tables (L) which increases hash time and memory use, they probe close-by buckets in the hash tables. That is, they probe bucket at B_j(Q) and those for slightly perturbed query Q. See ref [26].
  • See reference [2] for theory...
  • Following ref [42], use K randomized hash functions to generate the K data bits per vector. Each bit is the sign of the asymmetric random projection. Buckets contain a pointer to the node (neuron); only active buckets are kept around.
    • The K hash functions serve to increase the precision of the fingerprint -- found nodes are more expected to be active.
    • Have L hash tables for each hidden layer; these are used to increase the probability of finding useful / active nodes due to the randomness of the hash function.
    • Hash is asymmetric in the sense that the query and collection data are hashed independently.
  • In every layer during SGD, compute K x L hashes of the input, probe about 10 L buckets, and take their union. Experiments: K = 6 and L = 5.
  • See ref [30] where authors show around 500x reduction in computations for image search following different algorithmic and systems choices. Capsule: a camera based positioning system using learning {1506}
  • Use relatively small test data sets -- MNIST 8M, NORB, Convex, Rectangles -- each resized to have small-ish input vectors.

  • Really want more analysis of what exactly is going on here -- what happens when you change the hashing function, for example? How much is the training dependent on suitable ROC or precision/recall on the activation?
    • For example, they could have calculated the actual real activation & WTA selection, and compared it to the results from the hash function; how correlated are they?

hide / / print
ref: -2002 tags: hashing frequent items count sketch algorithm google date: 03-30-2020 02:04 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

Finding frequent items in data streams

  • Notation:
    • S is a data stream, S=q 1,q 2,...,q n S = q_1, q_2, ..., q_n length n.
    • Each object q iO=o 1,...o mq_i \in O = {o_1, ... o_m} That is, there are m total possible objects (e.g. English words).
    • Object o i o_i occurs n in_i times in S. The o no_n are ordered so that n 1n 2n m n_1 \geq n_2 \geq n_m .
  • Task:
    • Given an input stream S, integer k, and real ε\epsilon
    • Output a list of k elements from S such that each element has n i>(1ε)n k n_i \gt (1-\epsilon)n_k .
      • That is, if the ordering is perfect, n in k n_i \geq n_k , with equality on the last element.
  • Algorithm:
    • h 1,...,h th_1, ..., h_t hashes from object q to buckets 1,...,b{1, ..., b}
    • s 1,...,s ts_1, ..., s_t hashes from object q to 1,+1{-1, +1}
    • For each symbol, add it to the 2D hash array by hashing first with h ih_i , then increment that counter with s is_i .
      • The double-hasihing is to reduce the effect of collisions with high-frequency items.
    • When querying for frequency of a object, hash like others, and take the median over i of h i[q]*s i[q] h_i[q] * s_i[q]
    • t=O(log(nδ))t = O(log(\frac{n}{\delta})) where the algorithm fails with at most probability δ\delta
  • Demonstrate proof of convergence / function with Zipfian distributions with varying exponent. (I did not read through this).
  • Also showed that it's possible to compare these hash-counts directly to see what's changed,or importantly if the documents are different.

Mission: Ultra large-scale feature selection using Count-Sketches
  • Task:
    • Given a labeled dataset (X i,y i)(X_i, y_i) for i1,2,...,ni \in {1,2, ..., n} and X i p,y iX_i \in \mathbb{R}^p, y_i \in \mathbb{R}
    • Find the k-sparse feature vector / linear regression for the mean squares problem min||B|| 0=k||yXΒ|| 2 \frac{min}{||B||_0=k} ||y-X\Beta||_2
      • ||B|| 0=k ||B||_0=k counts the non-zero elements in the feature vector.
    • THE number of features pp is so large that a dense Β\Beta cannot be stored in memory. (X is of course sparse).
  • Such data may be from ad click-throughs, or from genomic analyses ...
  • Use the count-sketch algorithm (above) for capturing & continually updating the features for gradient update.
    • That is, treat the stream of gradient updates, in the normal form g i=2λ(y iX iΒ iX t) tX ig_i = 2 \lambda (y_i - X_i \Beta_i X^t)^t X_i , as the semi-continuous time series used above as SS
  • Compare this with greedy thresholding, Iterative hard thresholding (IHT) e.g. throw away gradient information after each batch.
    • This discards small gradients which may be useful for the regression problem.
  • Works better, but not necessarily better than straight feature hashing (FH).
  • Meh.

hide / / print
ref: -2015 tags: winner take all sparsity artificial neural networks date: 03-28-2020 01:15 gmt revision:0 [head]

Winner-take-all Autoencoders

  • During training of fully connected layers, they enforce a winner-take all lifetime sparsity constraint.
    • That is: when training using mini-batches, they keep the k percent largest activation of a given hidden unit across all samples presented in the mini-batch. The remainder of the activations are set to zero. The units are not competing with each other; they are competing with themselves.
    • The rest of the network is a stack of ReLU layers (upon which the sparsity constraint is applied) followed by a linear decoding layer (which makes interpretation simple).
    • They stack them via sequential training: train one layer from the output of another & not backprop the errors.
  • Works, with lower sparsity targets, also for RBMs.
  • Extended the result to WTA covnets -- here enforce both spatial and temporal (mini-batch) sparsity.
    • Spatial sparsity involves selecting the single largest hidden unit activity within each feature map. The other activities and derivatives are set to zero.
    • At test time, this sparsity constraint is released, and instead they use a 4 x 4 max-pooling layer & use that for classification or deconvolution.
  • To apply both spatial and temporal sparsity, select the highest spatial response (e.g. one unit in a 2d plane of convolutions; all have the same weights) for each feature map. Do this for every image in a mini-batch, and then apply the temporal sparsity: each feature map gets to be active exactly once, and in that time only one hidden unit (or really, one location of the input and common weights (depending on stride)) undergoes SGD.
    • Seems like it might train very slowly. Authors didn't note how many epochs were required.
  • This, too can be stacked.
  • To train on larger image sets, they first extract 48 x 48 patches & again stack...
  • Test on MNIST, SVHN, CIFAR-10 -- works ok, and well even with few labeled examples (which is consistent with their goals)

hide / / print
ref: -0 tags: GEVI review voltage sensor date: 03-18-2020 17:43 gmt revision:22 [21] [20] [19] [18] [17] [16] [head]

Various GEVIs invented and evolved:

Ace-FRET sensors

  • PMID-26586188 Ace-mNeonGreen, an opsin-FRET sensor, might still be better in terms of SNR, but it's green.
    • Negative ΔF/F\Delta F / F with depolarization.
    • Fast enough to resolve spikes.
    • Rational design; little or no screening.
    • Ace is about six times as fast as Mac, and mNeonGreen has a ~50% higher extinction coefficient than mCitrine and nearly threefold better photostability (12)

  • PMID-31685893 A High-speed, red fluorescent voltage sensor to detect neural activity
    • Fusion of Ace2N + short linker + mScarlet, a bright (if not the brightest; highest QY) monomeric red fluorescent protein.
    • Almost as good SNR as Ace2N-mNeonGreen.
    • Also a FRET sensor; negative delta F with depolarization.
    • Ace2N-mNeon is not sensitive under two-photon illumination; presumably this is true of all eFRET sensors?
    • Ace2N drives almost no photocurrent.
    • Sought to maximize SNR: dF/F_0 X sqrt(F_0); screened 'only' 18 linkers to see what worked the best. Yet - it's better than VARNAM.
    • ~ 14% dF/F per 100mV depolarization.

Arch and Mac rhodopsin sensors

  • PMID-22120467 Optical recording of action potentials in mammalian neurons using a microbial rhodopsin Arch 2011
    • Endogenous fluorescence of the retinal (+ environment) of microbial rhodopsin protein Archaerhodopsin 3 (Arch) from Halorubrum sodomense.
    • Proton pump without proton pumping capabilities also showed voltage dependence, but slower kinetics.
      • This required one mutation, D95N.
    • Requires fairly intense illumination, as the QY of the fluorophore is low (9 x 10-4). Still, photobleaching rate was relatively low.
    • Arch is mainly used for neuronal inhibition.

  • PMID-25222271 Archaerhodopsin Variants with Enhanced Voltage Sensitive Fluorescence in Mammalian and Caenorhabditis elegans Neurons Archer1 2014
    • Capable of voltage sensing under red light, and inhibition (via proton pumping) under green light.
    • Note The high laser power used to excite Arch (above) fluorescence causes significant autofluorescence in intact tissue and limits its accessibility for widespread use.
    • Archers have 3-5x the fluorescence of WT Arch -- so, QY of ~3.6e-3. Still very dim.
    • Archer1 dF/F_0 85%; Archer2 dF/F_0 60% @ 100mV depolarization (positive sense).
    • Screened the proton pump of Gloeobacter violaceus rhodopsin; found mutations were then transferred to Arch.
      • Maybe they were planning on using the Geobacter rhodopsin, but it didn't work for some reason, so they transferred to Arch..
    • TS and ER export domains for localization.

  • PMID-24755708 Imaging neural spiking in brain tissue using FRET-opsin protein voltage sensors MacQ-mOrange and MacQ-mCitrine.
    • L. maculans (Mac) rhodopsin (faster than Arch) + FP mCitrine, FRET sensor + ER/TS.
    • Four-fold faster kinetics and 2-4x brighter than ArcLight.
      • No directed evolution to optimize sensitivity or brightness. Just kept the linker short & trimmed residues based on crystal structure.
    • ~5% delta F/F, can resolve spikes up to 10Hz.
    • Spectroscopic studies of the proton pumping photocycle in bacteriorhodopsin and Archaerhodopsin (Arch) have revealed that proton translocation through the retinal Schiff base changes chromophore absorption [24-26]
    • Used rational design to abolish the proton current (D139N and D139Q aka MacQ) ; screens to adjust the voltage sensing kinetics.
    • Still has photocurrents.
    • Seems that slice / in vivo is consistently worse than cultured neurons... in purkinje neurons, dF/F 1.2%, even though in vitro response was ~ 15% to a 100mV depolarization.
    • Imaging intensity 30mw/mm^2. (3W/cm^2)

  • PMID-24952910 All-optical electrophysiology in mammalian neurons using engineered microbial rhodopsins QuasAr1 and QuasAr1 2014
    • Directed evolution approach to improve the brightness and speed of Arch D95N.
      • Improved the fluorescence QY by 19 and 10x. (1 and 2, respectively -- Quasar2 has higher sensitivity).
    • also developed a low-intensity channelrhodopsin, Cheriff, which can be activated by blue light (lambda max = 460 nm)dim enough to not affect QuasAr.
    • They call the two of them 'Optopatch 2'.
    • Incident light intensity 1kW / cm^2 (!)

  • PMID-29483642 A robotic multidimensional directed evolution approach applied to fluorescent voltage reporters. Archon1 2018
    • Started with QuasAr2 (above), which was evolved from Arch. Intrinsic fluorescence of retinal in rhodopsin.
    • Expressed in HEK293T cells; then FACS, robotic cell picking, whole genome amplification, PCR, cloning.
    • Also evolved miRFP, deep red fluorescent protein based on bacteriophytochrome.
    • delta F/F of 80 and 20% with a 100mV depolarization.
    • We investigated the contribution of specific point mutations to changes in localization, brightness, voltage sensitivity and kinetics and found the patterns that emerged to be complex (Supplementary Table 6), with a given mutation often improving one parameter but worsening another.
    • If the original QY of Arch was 9e-4, and Quasar2 improved this by 10, and Archon1 improved this by 2.3x, then the QY of Archon1 is 0.02. Given the molar extinction coefficient is ~ 50000 for retinal, this means the brightness of the fluorescent probe is low, 1. (good fluorescent proteins and synthetic dyes have a brightness of ~90).
    • Big paper, moderate improvement.
    • SomArchon1 and SomCheriff serve as the basis of Optopatch4, e.g. All-optical electrophysiology reveals excitation, inhibition, and neuromodulation in cortical layer 1
    • Slow photobleaching, consistent with other Arch based GEVIs.

VSD - FP sensors

  • PMID-28811673 Improving a genetically encoded voltage indicator by modifying the cytoplasmic charge composition Bongwoori 2017
    • ArcLight derivative.
    • Arginine (positive charge) scanning mutagenesis of the linker region improved the signal size of the GEVI, Bongwoori, yielding fluorescent signals as high as 20% ΔF/F during the firing of action potentials.
    • Used the mutagenesis to shift the threshold for fluorescence change more negative, ~ -30mV.
    • Like ArcLight, it's slow.
    • Strong baseline shift due to the acidification of the neuron during AP firing (!)

  • Attenuation of synaptic potentials in dentritic spines
    • Found that SNR / dF / F_0 is limited by intracellular localization of the sensor.
      • This is true even though ArcLight is supposed to be in a dark state in the lower pH of intracellular organelles.. a problem worth considering.
      • Makes negative-going GEVI's more practical, as those not in the membrane are dark.

  • Fast two-photon volumetric imaging of an improved voltage indicator reveals electrical activity in deeply located neurons in the awake brain ASAP3 2018
    • Opsin-based GEVIs have been used in vivo with 1p excitation to report electrical activity of superficial neurons, but their responsivity is attenuated for 2p excitation. (!)
    • Site-directed evolution in HEK cells.
    • Expressed linear PCR products directly in the HEK cells, with no assembly / ligation required! (Saves lots of time: normally need to amplify, assemble into a plasmid, transfect, culture, measure, purify the plasimd, digest, EP PCR, etc).
    • Screened in a motorized 384-well conductive plate, electroporation electrode sequentially activates each on an upright microscope.
    • 46% improvement over ASAP2 R414Q
    • Ace2N-4aa-mNeon is not responsive under 2p illum; nor is Archon1 or Quasar2/3
    • ULOVE = AOD based fast local scanning 2-p random access scope.

  • Bright and tunable far-red chemigenetic indicators
    • GgVSD (same as ASAP above) + cp HaloTag + Si-Rhodamine JF635
    • ~ 4% dF/F_0 during APs.
    • Found one mutation, R476G in the linker between cp Halotag and S4 of the VSD, which doubled the sensitivity of HASAP.
    • Also tested a ArcLight type structure, CiVSD fused to Halotag.
      • HarcLght had negative dF/F_0 and ~ 3% change in response to APs.
    • No voltage sensitivity when the synthetic dye was largely in the zwitterionic form, eg. tetramethylrodamine.

hide / / print
ref: -0 tags: VARNUM GEVI genetically encoded voltage indicators FRET Ace date: 03-18-2020 17:12 gmt revision:5 [4] [3] [2] [1] [0] [head]

PMID-30420685 Fast in-vivo voltage imaging using a red fluorescent indicator

  • Kannan M, Vasan G, Huang C, Haziza S, Li JZ, Inan H, Schnitzer MJ, Pieribone VA.
  • Other genetically encoded voltage indicators (GEVI):
    • PMID-22958819 ArcLight (Peribone also last author) ; sign of ΔF/F\Delta F / F negative, but large, 35%! Slow tho? improvement in speed
    • ASAP3 ΔF/F\Delta F / F large, τ=3ms.\tau = 3 ms.
    • PMID-26586188 Ace-mNeon FRET based, Acetabularia opsin, fast kinetics + brightness of mNeonGreen.
    • Archon1 -- fast and sensitive, found (like VARNUM) using a robotic directed evolution or direct search strategy.
  • VARNAM is based on Acetabularia (Ace) + mRuby3, also FRET based, found via high-throughput voltage screen.
  • Archaerhodopsin require 1-12 W/mm^2 of illumination, vs. 50 mw/mm^2 for GFP based probes. Lots of light!
  • Systematic optimization of voltage sensor function: both the linker region (288 mutants), which affects FRET efficiency, as well as the opsin fluorophore region (768 mutants), which affects the wavelength of absorption / emission.
  • Some intracellular clumping (which will negatively affect sensitivity), but mostly localized to the membrane.
  • Sensitivity is still imperfect -- 4% in-vivo cortical neurons, though it’s fast enough to resolve 100 Hz spiking.
  • Can resolve post-synaptic EPSCs, but < 1 % ΔF/F\Delta F/F .
  • Tested all-optical ephys using VARNAM + blueshifted channelrhodopsin, CheRiff, both sparsely, and in PV targeted transgenetic model. Both work, but this is a technique paper; no real results.
  • Tested TEMPO fiber-optic recording in freely behaving mice (ish) -- induced ketamine waves, 0.5-4Hz.
  • And odor-induced activity in flies, using split-Gal4 expression tools. So many experiments.

hide / / print
ref: -2019 tags: Vale photostability bioarxiv DNA oragami photobleaching date: 03-10-2020 21:59 gmt revision:5 [4] [3] [2] [1] [0] [head]

A 6-nm ultra-photostable DNA Fluorocube for fluorescence imaging

  • Cy3n = sulfonated version of Cy3.
  • JF549 = azetidine modified version of tetramethyl rhodamine.

Also including some correspondence with the authors:


Nice work and nice paper, thanks for sharing .. and not at all what I had expected from Ron's comments! Below are some comments ... would love your opinion.

I'd expect that the molar absorption coefficients for the fluorocubes should be ~6x larger than for the free dyes and the single dye cubes (measured?), yet the photon yields for all except Cy3N maybe are around the yield for one dye molecule. So the quantum yield must be decreased by ~6x?

This in turn might be from a middling FRET which reduces lifetime, thereby the probability of ISC, photoelectron transfer, and hence photobleaching.

I wonder if in the case of ATTO 647N Cy5 and Cy3, the DNA is partly shielding the fluorphores from solvent (ala ethidium bromide), which also helps with stability, just like in fluorescent proteins. ATTO 647N generates a lot of singlet oxygen, who knows what it's doing to DNA.

Can you do a log-log autocorrelation of the blinking timeseries of the constructs? This may reveal different rate constants controlling dark/light states (though, for 6 coupled objects, might not be interpretable!)

Also, given the effect of DNA shielding, have you compared to free dyes to single-dye cubes other than supp fig 10? The fact that sulfonation made such a huge effect in brightness is suggestive.

Again, these are super interesting & exciting results!


I haven't directly looked at the molar absorption coefficient but judging from the data that I collected for the absorption spectra, there is certainly an increase for the fluorocubes compared to single dyes. I agree that this would be an interesting experiment and I am planning collect data to measure the molar absorption coefficient. I would also expect a ~6 fold increase for the Fluorocubes.

Yes, we suspect homo FRET to help reduce photobleaching. So far we only measured lifetimes in bulk but are planning to obtain lifetime data on the single-molecule level soon.

We also wondered if the DNA is providing some kind of shield for the fluorophores but could not design an experiment to directly test this hypothesis. If you have a suggestion, that would be wonderful.

The log-log autocorrelation of blinking events is indeed difficult to interpret. Already individual intensity traces of fluorocubes are difficult to analyze as many of them get brighter before they bleach. We are also wondering if some fluorocubes are emitting two photons simultaneously. We will hopefully be able to measure this soon.

hide / / print
ref: -0 tags: Na Ji 2p two photon fluorescent imaging pulse splitting damage bleaching date: 03-10-2020 21:44 gmt revision:6 [5] [4] [3] [2] [1] [0] [head]

PMID-18204458 High-speed, low-photodamage nonlinear imaging using passive pulse splitters

  • Core idea: take a single pulse and spread it out to N=2 kN= 2^k pulses using reflections and delay lines.
  • Assume two optical processes, signal SI αS \propto I^{\alpha} and photobleaching/damage DI βD \propto I^{\beta} , β>α>1\beta \gt \alpha \gt 1
  • Then an NN pulse splitter requires N 11/αN^{1-1/\alpha} greater average power but reduces the damage by N 1β/α.N^{1-\beta/\alpha}.
  • At constant signal, the same NN pulse splitter requires N\sqrt{N} more power, consistent with two photon excitation (proportional to the square of the intensity: N pulses of N/N\sqrt{N}/N intensity, 1/N per pulse fluorescence, Σ1\Sigma \rightarrow 1 overall fluorescence.)
  • This allows for shorter dwell times, higher power at the sample, lower damage, slower photobleaching, and better SNR for fluorescently labeled slices.
  • Examine the list of references too, e.g. "Multiphoton multifocal microscopy exploiting a diffractive optical element" (2003)

  • In practice, a pulse picker is useful when power is limited and bleaching is not a problem (as is with GCaMP6x)

hide / / print
ref: -2011 tags: two photon cross section fluorescent protein photobleaching Drobizhev date: 03-10-2020 21:10 gmt revision:7 [6] [5] [4] [3] [2] [1] [head]

PMID-21527931 Two-photon absorption properties of fluorescent proteins

  • Significant 2-photon cross section of red fluorescent proteins (same chromophore as DsRed) in the 700 - 770nm range, accessible to Ti:sapphire lasers ...
    • This corresponds to a S 0S nS_0 \rightarrow S_n transition
    • But but, photobleaching is an order of magnitude slower when excited by the direct S 0S 1S_0 \rightarrow S_1 transition (but the fluorophores can be significantly less bright in this regime).
      • Quote: the photobleaching of DsRed slows down by an order of magnitude when the excitation wavelength is shifted to the red, from 750 to 950 nm (32).
    • See also PMID-18027924
  • 2P cross-section in both the 700-800nm and 1000-1100 nm range corresponds to the chromophore polarizability, and is not related to 1p cross section.
  • This can be useflu for multicolor imaging: excitation of the higher S0 → Sn transition of TagRFP simultaneously with the first, S0 → S1, transition of mKalama1 makes dual-color two-photon imaging possible with a single excitation laser wavelength (13)
  • Why are red GECIs based on mApple (rGECO1) or mRuby (RCaMP)? dsRed2 or TagRFP are much better .. but maybe they don't have CP variants.
  • from https://elifesciences.org/articles/12727

hide / / print
ref: -0 tags: DNA paint FRET tag superresolution imaging oligos date: 02-20-2020 16:28 gmt revision:1 [0] [head]

Accelerated FRET-PAINT Microscopy

  • Well isn't that smart -- they use a FRET donor, which is free to associate and dissociate form a host DNA strand, and a more-permanently attached DNA acceptor, which blinks due to FRET, for superresolution imaging.
  • As FRET acceptors aren't subject to bleaching (or, perhaps, much less subject to bleaching), this eliminates that problem...
  • However, the light levels used ~1kW / cm^2, does damage the short DNA oligos, which interferes with reversible association.
  • Interestingly, CF488 donor showed very little photobleaching; DNA damage was instead the limiting problem.
    • Are dyes that bleach more slowly better at exporting their singlet oxygen (?) or aberrant excited states (?) to neighboring molecules?

hide / / print
ref: -0 tags: rhodamine derivatives imidazole bacterial resistance date: 02-19-2020 19:10 gmt revision:2 [1] [0] [head]

A diversity-oriented rhodamine library for wide-spectrum bactericidal agents with low inducible resistance against resistant pathogens

  • Tested a wide number of rhodamine derivatives, which were synthesized with a 'mild' route. This includes all sorts of substitutions on the carbon opposite the oxygen.
  • Tested the fluorescence properties ... many if not all are fluorescent. Supplementary information lists the abs/em spectra, which is kind of a goldmine (if it can be trusted).
  • No mention of light or dark in the paper. I suspect that these rhodamine derivatives are killing via singlet oxygen production. (Then again, I only skimmed the paper..)
    • Yes but: "Rhodamine dyes mainly adopted the ring-close forms exhibit no antibacterial activity against ATCC43300 or ATCC19606"
    • That's because they are colorless and can't emit any singlet oxygen!

hide / / print
ref: -0 tags: two photon scanning microscope mirror relay date: 01-31-2020 02:46 gmt revision:1 [0] [head]

PMID-24877017 Optimal lens design and use in laser-scanning microscopy

  • Detail careful design of a scanning two-photon microscope, with custom scan lens, tube lens, and standard 25x objective.
  • Near diffraction limited performance for both the scan and tube lenses across a broad excitation range -- 690 to 1400nm.
  • Interestingly, use a parabolic mirror relay to conjugate the two galvos to each other; seems like a good idea, why has this not been done elsewhere?

hide / / print
ref: -0 tags: lavis jf dyes fluorine zwitterion lactone date: 01-22-2020 20:06 gmt revision:0 [head]

Optimization and functionalization of red-shifted rhodamine dyes

  • Zwitterion form is fluorescent and colored; lactone form is not and colorless.
  • Lactone form is lipophyllic; some mix seems more bioavailable and also results in fluorogenic dyes.
  • Good many experiments with either putting fluorine on the azetidines or on the benzyl ring.
  • Fluorine on the azetidine pushes the K ZLK_{Z-L} toward lactone form; fluorine on the benzyl ring pushes it toward the zwitterion.
  • Si-rhodamine and P-rhodamine adopt the lactone form, and adding appropriate fluorines can make them fluorescent again. Which makes for good red-shifted dyes, ala JF669
  • N-CH3 can be substituted in the oxygen position too, resulting in blue-shifted dye which is a good stand-in for EGFP.

hide / / print
ref: -0 tags: multifactor synaptic learning rules date: 01-22-2020 01:45 gmt revision:9 [8] [7] [6] [5] [4] [3] [head]

Why multifactor?

  • Take a simple MLP. Let xx be the layer activation. X 0X^0 is the input, X 1X^1 is the second layer (first hidden layer). These are vectors, indexed like x i ax^a_i .
  • Then X 1=WX 0X^1 = W X^0 or x j 1=ϕ(Σ i=1 Nw ijx i 0)x^1_j = \phi(\Sigma_{i=1}^N w_{ij} x^0_i) . ϕ\phi is the nonlinear activation function (ReLU, sigmoid, etc.)
  • In standard STDP the learning rule follows Δwf(x pre(t),x post(t)) \Delta w \propto f(x_{pre}(t), x_{post}(t)) or if layer number is aa Δw a+1f(x a(t),x a+1(t))\Delta w^{a+1} \propto f(x^a(t), x^{a+1}(t))
    • (but of course nobody thinks there 'numbers' on the 'layers' of the brain -- this is just referring to pre and post synaptic).
  • In an artificial neural network, Δw aEw ij aδ j ax i \Delta w^a \propto - \frac{\partial E}{\partial w_{ij}^a} \propto - \delta_{j}^a x_{i} (Intuitively: the weight change is proportional to the error propagated from higher layers times the input activity) where δ j a=(Σ k=1 Nw jkδ k a+1)ϕ \delta_{j}^a = (\Sigma_{k=1}^{N} w_{jk} \delta_k^{a+1}) \partial \phi where ϕ\partial \phi is the derivative of the nonlinear activation function, evaluated at a given activation.
  • f(i,j)[x,y,θ,ϕ] f(i, j) \rightarrow [x, y, \theta, \phi]
  • k=13.165 k = 13.165
  • x=round(i/k) x = round(i / k)
  • y=round(j/k) y = round(j / k)
  • θ=a(ikx)+b(ikx) 2 \theta = a (\frac{i}{k} - x) + b (\frac{i}{k} - x)^2
  • ϕ=a(jky)+b(jky) 2 \phi = a (\frac{j}{k} - y) + b (\frac{j}{k} - y)^2

hide / / print
ref: -2017 tags: human level concept learning through probabalistic program induction date: 01-20-2020 15:45 gmt revision:0 [head]

PMID-26659050 Human level concept learning through probabalistic program induction

  • Preface:
    • How do people learn new concepts from just one or a few examples?
    • And how do people learn such abstract, rich, and flexible representations?
    • How can learning succeed from such sparse dataset also produce such rich representations?
    • For any theory of learning, fitting a more complicated model requires more data, not less, to achieve some measure of good generalization, usually in the difference between new and old examples.
  • Learning proceeds bu constructing programs that best explain the observations under a Bayesian criterion, and the model 'learns to learn' by developing hierarchical priors that allow previous experience with related concepts to ease learning of new concepts.
  • These priors represent learned inductive bias that abstracts the key regularities and dimensions of variation holding actoss both types of concepts and across instances.
  • BPL can construct new programs by reusing pieced of existing ones, capturing the causal and compositional properties of real-world generative processes operating on multiple scales.
  • Posterior inference requires searching the large combinatorial space of programs that could have generated a raw image.
    • Our strategy uses fast bottom-up methods (31) to propose a range of candidate parses.
    • That is, they reduce the character to a set of lines (series of line segments), then simply the intersection of those lines, and run a series of parses to estimate the generation of those lines, with heuristic criteria to encourage continuity (e.g. no sharp angles, penalty for abruptly changing direction, etc).
    • The most promising candidates are refined by using continuous optimization and local search, forming a discrete approximation to the posterior distribution P(program, parameters | image).