m8ta
You are not authenticated, login. 

{1569} 
ref: 2022
tags: symbolic regression facebook AI transformer
date: 05172022 20:25 gmt
revision:0
[head]


Deep symbolic regression for recurrent sequences Surprisingly, they do not do any network structure changes; it’s Vaswini 2017w/ a 8head, 8 layer transformer (sequence to sequence, not decoder only) with a latent dimension of 512. Significant work was in feature / representation engineering (e.g. base10k representations of integers and fixedprecision representations of floatingpoint numbers. (both of these involve a vocabulary size of ~10k ... amazing still that this works..)) + the significant training regimen they worked with (16 Turing GPUs, 32gb ea). Note that they do perform a bit of beamsearch over the symbolic regressions by checking how well each node fits to the starting sequence, but the models work even without this degree of refinement. (As always, there undoubtedly was significant effort spent in simply getting everything to work) The paper does both symbolic (estimate the algebraic recurence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (logscaled) noise! Analysis of how the transformers work for these problems is weak; only one figure showing that the embeddings of the integers follows some meandering but continuous path in tSNE space. Still, the trained transformer is able to usually best handcoded sequence inference engine(s) in Mathematica, and does so without memorizing all of the training data. Very impressive and important result, enough to convince that this learned representation (and undiscovered cleverness, perhaps) beats human mathematical engineering, which probably took longer and took more effort. It follows, without too much imagination (but vastly more compute), that you can train an 'automatic programmer' in the very same way.  
{842}  
Distilling freeform natural laws from experimental data
Since his Phd, Michael Schmidt has gone on to found Nutonian, which produced Eurequa software, apparently without dramatic new features other than being able to use the cloud for equation search. (Probably he improved many other detailed facets of the software..). Nutonian received $4M in seed funding, according to Crunchbase. In 2017, Nutonian was acquired by Data Robot (for an undisclosed amount), where Michael has worked since, rising to the title of CTO. Always interesting to follow up on the authors of these classic papers!  
{1556} 
ref: 0
tags: concept net NLP transformers graph representation knowledge
date: 11042021 17:48 gmt
revision:0
[head]


Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
Humandesigned knowledge graphs are described here: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge And employed for profit here: https://www.luminoso.com/  
{1544}  
The HSIC Bottleneck: Deep learning without Backpropagation In this work, the authors use a kernelized estimate of statistical independence as part of a 'information bottleneck' to set perlayer objective functions for learning useful features in a deep network. They use the HSIC, or Hilbertschmidt independence criterion, as the independence measure. The information bottleneck was proposed by Bailek (spikes..) et al in 1999, and aims to increase the mutual information between the layer representation and the labels while minimizing the mutual information between the representation and the input: $\frac{min}{P_{T_i  X}} I(X; T_i)  \beta I(T_i; Y)$ Where $T_i$ is the hidden representation at layer i (later output), $X$ is the layer input, and $Y$ are the labels. By replacing $I()$ with the HSIC, and some derivation (?), they show that $HSIC(D) = (m1)^{2} tr(K_X H K_Y H)$ Where $D = {(x_1,y_1), ... (x_m, y_m)}$ are samples and labels, $K_{X_{ij}} = k(x_i, x_j)$ and $K_{Y_{ij}} = k(y_i, y_j)$  that is, it's the kernel function applied to all pairs of (vectoral) input variables. H is the centering matrix. The kernel is simply a Gaussian kernel, $k(x,y) = exp(1/2 xy^2/\sigma^2)$ . So, if all the x and y are on average independent, then the innerproduct will be mean zero, the kernel will be mean one, and after centering will lead to zero trace. If the inner product is large within the realm of the derivative of the kernel, then the HSIC will be large (and negative, i think). In practice they use three different widths for their kernel, and they also center the kernel matrices. But still, the feedback is an aggregate measure (the trace) of the product of two kernelized (a nonlinearity) outerproduct spaces of similarities between inputs. it's not unimaginable that feedback networks could be doing something like this... For example, a neural network could calculate & communicate aspects of joint statistics to reward / penalize weights within a layer of a network, and this is parallelizable / per layer / adaptable to an unsupervised learning regime. Indeed, that was done almost exactly by this paper: Kernelized information bottleneck leads to biologically plausible 3factor Hebbian learning in deep networks albeit in a much less intelligible way. Robust Learning with the HilbertSchmidt Independence Criterion Is another, later, paper using the HSIC. Their interpretation: "This lossfunction encourages learning models where the distribution of the residuals between the label and the model prediction is statistically independent of the distribution of the instances themselves." Hence, given above nomenclature, $E_X( P_{T_i  X} I(X ; T_i) ) = 0$ (I'm not totally sure about the weighting, but might be required given the definition of the HSIC.) As I understand it, the HSIC loss is a kernellized loss between the input, output, and labels that encourages a degree of invariance to input ('covariate shift'). This is useful, but I'm unconvinced that making the layer output independent of the input is absolutely essential (??)  
{1548} 
ref: 2021
tags: gated multi layer perceptrons transformers ML Quoc_Le Google_Brain
date: 08052021 06:00 gmt
revision:4
[3] [2] [1] [0] [head]


Pretty remarkable that an industrial lab freely publishes results like this. I guess the ROI is that they get the resultant improved ideas? Or, perhaps, Google is in such a dominant position in terms of data and compute that even if they give away ideas and code, provided some of the resultant innovation returns to them, they win. The return includes trained people as well as ideas. Good for us, I guess!  
{1527} 
ref: 0
tags: inductive logic programming deepmind formal propositions prolog
date: 11212020 04:07 gmt
revision:0
[head]


Learning Explanatory Rules from Noisy Data
 
{305}  
PMID101388[0] Fine control of operantly conditioned firing patterns of cortical neurons.
____References____  
{1440}  
 
{1207} 
ref: 0
tags: Shenoy eye position BMI performance monitoring
date: 01252013 00:41 gmt
revision:1
[0] [head]


PMID18303802 Cortical neural prosthesis performance improves when eye position is monitored.
 
{1087} 
ref: Timmermann2003.01
tags: DBS double tremor oscillations DICS beamforming parkinsons
date: 02292012 00:39 gmt
revision:4
[3] [2] [1] [0] [head]


PMID12477707[0] The cerebral oscillatory network of parkinsonian resting tremor.
____References____
 
{1132}  
PMID20400953 Dissolvable films of silk fibroin for ultrathin conformal biointegrated electronics.
 
{255} 
ref: BarGad2003.12
tags: information dimensionality reduction reinforcement learning basal_ganglia RDDR SNR globus pallidus
date: 01162012 19:18 gmt
revision:3
[2] [1] [0] [head]


PMID15013228[] Information processing, dimensionality reduction, and reinforcement learning in the basal ganglia (2003)
____References____  
{806}  
I've recently tried to determine the bitrate of conveyed by one gaussian random process about another in terms of the signaltonoise ratio between the two. Assume $x$ is the known signal to be predicted, and $y$ is the prediction. Let's define $SNR(y) = \frac{Var(x)}{Var(err)}$ where $err = xy$ . Note this is a ratio of powers; for the conventional SNR, $SNR_{dB} = 10*log_{10 } \frac{Var(x)}{Var(err)}$ . $Var(err)$ is also known as the meansquarederror (mse). Now, $Var(err) = \sum{ (x  y  sstrch \bar{err})^2 estrch} = Var(x) + Var(y)  2 Cov(x,y)$ ; assume x and y have unit variance (or scale them so that they do), then $\frac{2  SNR(y)^{1}}{2 } = Cov(x,y)$ We need the covariance because the mutual information between two jointly Gaussian zeromean variables can be defined in terms of their covariance matrix: (see http://www.springerlink.com/content/v026617150753x6q/ ). Here Q is the covariance matrix, $Q = \left[ \array{Var(x) & Cov(x,y) \\ Cov(x,y) & Var(y)} \right]$ $MI = \frac{1 }{2 } log \frac{Var(x) Var(y)}{det(Q)}$ $Det(Q) = 1  Cov(x,y)^2$ Then $MI =  \frac{1 }{2 } log_2 \left[ 1  Cov(x,y)^2 \right]$ or $MI =  \frac{1 }{2 } log_2 \left[ SNR(y)^{1}  \frac{1 }{4 } SNR(y)^{2} \right]$ This agrees with intuition. If we have a SNR of 10db, or 10 (power ratio), then we would expect to be able to break a random variable into about 10 different categories or bins (recall stdev is the sqrt of the variance), with the probability of the variable being in the estimated bin to be 1/2. (This, at least in my mind, is where the 1/2 constant comes from  if there is gaussian noise, you won't be able to determine exactly which bin the random variable is in, hence log_2 is an overestimator.) Here is a table with the respective values, including the amplitude (not power) ratio representations of SNR. "
Now, to get the bitrate, you take the SNR, calculate the mutual information, and multiply it by the bandwidth (not the sampling rate in a discrete time system) of the signals. In our particular application, I think the bandwidth is between 1 and 2 Hz, hence we're getting 1.63.2 bits/second/axis, hence 3.26.4 bits/second for our normal 2D tasks. If you read this blog regularly, you'll notice that others have achieved 4bits/sec with one neuron and 6.5 bits/sec with dozens {271}.  
{5} 
ref: bookmark0
tags: machine_learning research_blog parallel_computing bayes active_learning information_theory reinforcement_learning
date: 12312011 19:30 gmt
revision:3
[2] [1] [0] [head]


hunch.net interesting posts:
 
{968} 
ref: Bassett2009.07
tags: Weinberger congnitive efficiency beta band neuroimagaing EEG task performance optimization network size effort
date: 12282011 20:39 gmt
revision:1
[0] [head]


PMID19564605[0] Cognitive fitness of costefficient brain functional networks.
____References____
 
{922}  
PMID20011034[0] A Wireless BrainMachine Interface for RealTime Speech Synthesis
____References____
 
{252}  
PMID15022843[0] A simulation study of information transmission by multiunit microelectrode recordings key idea:
____References____  
{289}  
PMID11395017[0] Neuronal correlates of motor performance and motor learning in the primary motor cortex of monkeys adapting to an external force field
____References____  
{565} 
ref: Walker2005.12
tags: algae transfection transformation protein synthesis bioreactor
date: 03212008 17:22 gmt
revision:1
[0] [head]


Microalgae as bioreactors PMID16136314
 
{530}  
 
{520}  
http://www.dspguide.com/ch34.htm  awesome!!  
{344}  
PMID2027042[0] Making arm movements within different parts of space: the premotor and motor cortical representation of a coordinate system for reaching to visual targets.
____References____  
{294}  
PMID2376768[0] Making arm movements within different parts of space: dynamic aspects in the primate motor cortex
____References____  
{229} 
ref: notes0
tags: SNR MSE error multidimensional mutual information
date: 03082007 22:33 gmt
revision:2
[1] [0] [head]


http://ieeexplore.ieee.org/iel5/516/3389/00116771.pdf or http://hardm.ath.cx:88/pdf/MultidimensionalSNR.pdf
 
{146} 
ref: van2004.11
tags: anterior cingulate cortex error performance monitoring 2004
date: 002007 0:0
revision:0
[head]


PMID15518940 Errors without conflict: implications for performance monitoring theories of anterior cingulate cortex.
 
{7} 
ref: bookmark0
tags: book information_theory machine_learning bayes probability neural_networks mackay
date: 002007 0:0
revision:0
[head]


http://www.inference.phy.cam.ac.uk/mackay/itila/book.html  free! (but i liked the book, so I bought it :)  
{66} 
ref: bookmark0
tags: machine_learning classification entropy information
date: 002006 0:0
revision:0
[head]


http://iridia.ulb.ac.be/~lazy/  Lazy Learning.  
{57}  
http://www.cs.rug.nl/~rudy/matlab/
