m8ta
You are not authenticated, login. 

{1578}  
God Help us, let's try to understand AI monosemanticity Commentary: To some degree, superposition seems like a geometric "hack" invented in the process of optimization to squeeze a great many (largely mutuallyexclusive) sparse features into a limited number of neurons. GPT3 has a latent dimension of only 96 * 128 = 12288, and with 96 layers this is only 1.17 M neurons (*). A fruit fly has 100k neurons (and can't speak). All communication must be through that 12288 dimensional vector, which is passed through LayerNorm many times (**), so naturally the network learns to take advantage of locally linear subspaces. That said, the primate visual system does seem to use superposition, though not via local subspaces; instead, neurons seem to encode multiple axes somewhat linearly (e.g. global spaces: linearly combined position and class) That was a few years ago, and I suspect that new results may contest this. The face area seems to do a good job of disentanglement, for example. Treating everything as highdimensional vectors is great for analogy making, like the wife  husband + king = queen example. But having fixedsize vectors for representing arbitrarydimensioned relationships inevitably leads to compression ~= superposition. Provided those subspaces are semantically meaningful, it all works out from a generalization standpoint  but this is then equivalent to allocating an additional axis for said relationship or attribute. Additional axes would also put less decoding burden on the downstream layers, and make optimization easier. Google has demonstrated allocation in transformers. It's also prevalent in the cortex. Trick is getting it to work! (*) GPT4 is unlikely to have more than an order of magnitude more 'neurons'; PaLM540B has only 2.17 M. Given that GPT4 is something like 34x larger, it should have 68 M neurons, which is still 3 orders of magnitude fewer than the human neocortex (nevermind the cerebellum ;) (**) I'm of two minds on LayerNorm. PV interneurons might be seen to do something like this, but it's all local  you don't need everything to be vector rotations. (LayerNorm effectively removes one degree of freedom, so really it's a 12287 dimensional vector) Update: After reading https://transformercircuits.pub/2023/monosemanticfeatures/index.html, I find the idea of local manifolds / local codes to be quite appealing: why not represent sparse yet conditional features using superposition? This also expands the possibility of pseudohierarchical representation, which is great.  
{1577}  
Sketch  Program synthesis by sketching
The essential algorithm, in words: Take the sketch, expand it to a set of parameterized variables, holes, and calling contexts. Convert these to a DAG aka (?) datacode flow graph w/ dependencies. Try to simplify the DAG, onehot encode integers, and convert to either a conjunctivenormalform (CNF) SAT problem for MiniSat, or to a boolean circuit for the ABC solver. Apply MiniSat or ABC to the problem to select a set of control values = values for the holes & permutations that satisfy the boolean constraints. Using this solution, use the SAT solver to find a input variable configuration that does not satisfy the problem. This serves as a counterexample. Run this through the validator function (oracle) to see what it does; use the counterexample and (inputs and outputs) to add clauses to the SAT problem. Run several times until either no counterexamples can be found or the problem is `unsat`. Though the thesis describes a system that was academic & relatively small back in 2008, Sketch has enjoyed continuous development, and remains used. I find the work that went into it to be remarkable and impressive  even with incremental improvements, you need accurate expansion of the language & manipulations to show proofofprinciple. Left wondering what limits its application to even larger problems  need for a higherlevel loop that further subdivides / factorizes the problem, or DFS for filling out elements of the sketch? Interesting links discovered in while reading the dissertation:
 
{1576} 
ref: 0
tags: GFlowNet Bengio probabilty modelling reinforcement learing
date: 10292023 19:17 gmt
revision:3
[2] [1] [0] [head]


 
{1562}  
Modern SAT solvers: fast, neat and underused (part 1 of N) A set of posts that are worth rereading. See also: https://www.borealisai.com/researchblogs/tutorial11satsolversiiifactorgraphsandsmtsolvers/ Of note: Selsam 2019 indicates that Survey propagation (Knuth 2015), an extension to Belief propagation, does not seem to extend well to hard SAT problems.  
{1549}  
Put this in ~/.config/gtk3.0/gtk.css and ~/.config/gtk4.0/gtk.css to make scrollbars larger & permanently visible on highDPI screens. ref .scrollbar { GtkScrollbarhasbackwardstepper: 1; GtkScrollbarhasforwardstepper: 1; GtkRangesliderwidth: 16; GtkRangesteppersize: 16; } scrollbar slider { /* Size of the slider */ minwidth: 16px; minheight: 16px; borderradius: 16px; /* Padding around the slider */ border: 2px solid transparent; } .scrollbar.vertical slider, scrollbar.vertical slider { minheight: 16px; minwidth: 16px; } .scrollbar.horizontal.slider, scrollbar.horizontal slider { minwidth: 16px; minheight: 16px; } /* Scrollbar trough squeezes when cursor hovers over it. Disabling that */ .scrollbar.vertical:hover:dir(ltr), .scrollbar.vertical.dragging:dir(ltr) { marginleft: 0px; } .scrollbar.vertical:hover:dir(rtl), .scrollbar.vertical.dragging:dir(rtl) { marginright: 0px; } .scrollbar.horizontal:hover, .scrollbar.horizontal.dragging, .scrollbar.horizontal.slider:hover, .scrollbar.horizontal.slider.dragging { margintop: 0px; } undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { backgroundimage: none; } undershoot.top, undershoot.right, undershoot.bottom, undershoot.left { backgroundimage: none; } Also add: export GTK_OVERLAY_SCROLLING=0to your ~/.bashrc This does not work with GTK4, though  to do that, put the following in ~/.config/gtk4.0/settings.ini: [Settings] gtkoverlayscrolling = false To make the scrollbars a bit easier to see in QT5 applications, run qt5ct (after aptgetting it), and add in a new style sheet, /usr/share/qt5ct/qss/scrollbarsimplebackup.qss /* SCROLLBARS (NOTE: Changing 1 subcontrol means you have to change all of them)*/ QScrollBar{ background: palette(alternatebase); } QScrollBar:horizontal{ margin: 0px 0px 0px 0px; } QScrollBar:vertical{ margin: 0px 0px 0px 0px; } QScrollBar::handle{ background: #816891; border: 1px solid transparent; borderradius: 1px; } QScrollBar::handle:hover, QScrollBar::addline:hover, QScrollBar::subline:hover{ background: palette(highlight); } QScrollBar::addline{ subcontrolorigin: none; } QScrollBar::addline:vertical, QScrollBar::subline:vertical{ height: 0px; } QScrollBar::addline:horizontal, QScrollBar::subline:horizontal{ width: 0px; } QScrollBar::subline{ subcontrolorigin: none; }  
{1575}  
 
{1574} 
ref: 0
tags: ocaml application functional programming
date: 10112022 21:36 gmt
revision:2
[1] [0] [head]


https://stackoverflow.com/questions/26475765/ocamlfunctionwithvariablenumberofarguments From this I learned that in ocaml you can return not just functions (e.g. currying) but appliations of yettobe named functions. let sum f = f 0 ;; let arg a b c = c ( b + a ) ;; let z a = a ;; then sum (arg 1) ;; is welltyped as (int > `a) > `a = <fun> e.g. an application of a function that converts int to `a. Think of it as the application of Xa to argument ( 0 + 1 ), where Xa is the argument (per type signature). Zero is supplied by the definition of 'sum'. sum (arg 1) (arg 2);; can be parsed as (sum (arg 1)) (arg 2) ;; '(arg 2)' outputs an application of an int & a yetto be determined function to 'a, E.g. it's typed as int > (int > `a) > `a = <fun>. So, you can call it Xa passed to above. Or, Xa = Xb( ( 0 + 1 ) + 2) where, again, Xb is a yettobe defined function that is supplied as an argument. Therefore, you can collapse the whole chain with the identity function z. But, of course, it could be anything else  square root perhaps for MSE? All very clever.  
{1573}  
PMID36070680 Extracellular vesicles mediate the communication of adipose tissue with brain and promote cognitive impairment associated with insulin resistance
 
{1571}  
One model for the learning of language
A more interesting result is Deep symbolic regression for recurrent sequences, where the authors (facebook/meta) use a Transformer  in this case, directly taken from Vaswini 2017 (8head, 8layer QKV w/ a latent dimension of 512) to do both symbolic (estimate the algebraic recurrence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (logscaled) noise! While the language learning paper shows that small generative programs can be inferred from a few samples, the Meta symbolic regression shows that Transformers can evince either amortized memory (less likely) or algorithms for perception  both new and interesting. It suggests that 'even' abstract symbolic learning tasks are sufficiently decomposable that the sorts of algorithms available to an 8layer transformer can give a useful search heuristic. (N.B. That the transformer doesn't spit out perfect symbolic or numerical results directly  it also needs postprocessing search. Also, the transformer algorithm has search (in the form of softmax) baked in to it's architecture.) This is not a light architecture: they trained the transformer for 250 epochs, where each epoch was 5M equations in batches of 512. Each epoch took 1 hour on 16 Volta GPUs w 32GB of memory. So, 4k GPUhours x ~10 TFlops = 1.4e20 Flops. Compare this with grammar learning above; 7 days on 32 cores operating at ~ 3Gops/sec is 1.8e15 ops. Much, much smaller compute. All of this is to suggest a central theme of computer science: a continuum between search and memorization.
Most interesting for a visual neuroscientist (not that I'm one per se, but bear with me) is where on these axes (search, heuristic, memory) visual perception is. Clearly there is a high degree of recurrence, and a high degree of plasticity / learning. But is there search or local optimization? Is this coupled to the recurrence via some form of energyminimizing system? Is recurrence approximating EM?  
{1572} 
ref: 2019
tags: Piantadosi cogntion combinators function logic
date: 09052022 01:57 gmt
revision:0
[head]


 
{1570}  
Kickback cuts Backprop's redtape: Biologically plausible credit assignment in neural networks Bit of a meh  idea is, rather than propagating error signals backwards through a hierarchy, you propagate only one layer + use a signed global reward signal. This works by keeping the network ‘coherent’  positive neurons have positive input weights, and negative neurons have negative weights, such that the overall effect of a weight change does not change sign when propagated forward through the network. This is kind of a lame shortcut, imho, as it limits the types of functions that the network can model & the computational structure of the network. This is already quite limited by the dotproductrectifier common structure (as is used here). Much more interesting and possibly necessary (given much deeper architectures now) is to allow units to change sign. (Open question as to whether they actually frequently do!). As such, the model is in the vein of "how do we make backprop biologically plausible by removing features / communication" rather than "what sorts of signals and changes does the brain use perceive and generate behavior". This is also related to the literature on what ResNets do; what are the skip connections for? Amthropic has some interesting analyses for Transformer architectures, but checking the literature on other resnets is for another time.  
{1569} 
ref: 2022
tags: symbolic regression facebook AI transformer
date: 05172022 20:25 gmt
revision:0
[head]


Deep symbolic regression for recurrent sequences Surprisingly, they do not do any network structure changes; it’s Vaswini 2017w/ a 8head, 8 layer transformer (sequence to sequence, not decoder only) with a latent dimension of 512. Significant work was in feature / representation engineering (e.g. base10k representations of integers and fixedprecision representations of floatingpoint numbers. (both of these involve a vocabulary size of ~10k ... amazing still that this works..)) + the significant training regimen they worked with (16 Turing GPUs, 32gb ea). Note that they do perform a bit of beamsearch over the symbolic regressions by checking how well each node fits to the starting sequence, but the models work even without this degree of refinement. (As always, there undoubtedly was significant effort spent in simply getting everything to work) The paper does both symbolic (estimate the algebraic recurence relation) and numeric (estimate the rest of the sequence) training / evaluation. Symbolic regression generalizes better, unsurprisingly. But both can be made to work even in the presence of (logscaled) noise! Analysis of how the transformers work for these problems is weak; only one figure showing that the embeddings of the integers follows some meandering but continuous path in tSNE space. Still, the trained transformer is able to usually best handcoded sequence inference engine(s) in Mathematica, and does so without memorizing all of the training data. Very impressive and important result, enough to convince that this learned representation (and undiscovered cleverness, perhaps) beats human mathematical engineering, which probably took longer and took more effort. It follows, without too much imagination (but vastly more compute), that you can train an 'automatic programmer' in the very same way.  
{1568}  
Burstdependent synaptic plasticity can coordinate learning in hierarchical circuits
 
{1567} 
ref: 0
tags: evolution simplicity symmetry kolmogorov complexity polyominoes protein interactions
date: 04212022 18:22 gmt
revision:5
[4] [3] [2] [1] [0] [head]


Symmetry and simplicity spontaneously emerge from the algorithmic nature of evolution
The paper features a excellent set of references, including:
Letter to a friend following her article Machine learning in evolutionary studies comes of age Read your PNAS article last night, super interesting that you can get statistical purchase on longlost evolutionary 'sweeps' via GANs and other neural network models. I feel like there is some sort of statistical power issue there? DNNs are almost always overparameterized... slightly suspicious. This morning I was sleepily mulling things over & thought about a walking conversation that we had a long time ago in the woods of NC: Why is evolution so effective? Why does it seem to evolve to evolve? Thinking more  and having years more perspective  it seems almost obvious in retrospect: it's a consequence of Bayes' rule. Evolution finds solutions in spaces that have overwhelming prevalence of working solutions. The prior has an extremely strong effect. These representational / structural spaces by definition have many nearby & associated solutions, hence appear posthoc 'evolvable'. (You probably already know this.) I think proteins very much fall into this category: AA were added to the translation machinery based on ones that happened to solve a particular problem... but because of the 'generalization prior' (to use NN parlance), they were useful for many other things. This does not explain the humanengineeringlike modularity of mature evolved systems, but maybe that is due to the strong simplicity prior [1] Very very interesting to me is how the science of evolution and neural networks are drawing together, vis a vis the lottery ticket hypothesis. Both evince a continuum of representational spaces, too, from highdimensional vectoral (how all modern deep learning systems work) to lowdimensional modular, specific, and general (phenomenological human cognition). I suspect that evolution uses a form of this continuum, as seen in the human highdimensional longrange gene regulatory / enhancer network (= a structure designed to evolve). Not sure how selection works here, though; it's hard to search a highdimensional space. The brain has an almost identical problem: it's hard to do 'credit assignment' in a billionslarge, deep and recurrent network. Finding which set of synapses caused a good / bad behaviior takes a lot of bits.  
{1566}  
Interactions between learning and evolution
Altogether (historically) interesting, but some of these ideas might well have been anticipated by some simple hand calculations.  
{1565}  
Compiling a list of saturated matrixmatrix gflops for various Nvidia GPUs.
 
{1564}  
“Visualizing data using tSNE”
 
{1563}  
The Sony Xperia XZ1 compact is a better phone than an Apple iPhone 12 mini I don't normally write any personal options here  just halffinished paper notes riddled with typos (haha)  but this one has been bothering me for a while. November 2020 I purchased an iPhone 12 mini to replace my aging Sony Xperia XZ1 compact. (Thinking of staying with Android, I tried out a Samsung S10e as well, but didn't like it.) Having owned and used the iPhone for a year and change, I still prefer the Sony. Here is why:
Summary: I'll try to get my moneys worth out of the iPhone; when it dies, will buy the smallest waterproof Android phone that supports my carrier's bands.  
{1561}  
Cortical response selectivity derives from strength in numbers of synapses
 
{842}  
Distilling freeform natural laws from experimental data
Since his Phd, Michael Schmidt has gone on to found Nutonian, which produced Eurequa software, apparently without dramatic new features other than being able to use the cloud for equation search. (Probably he improved many other detailed facets of the software..). Nutonian received $4M in seed funding, according to Crunchbase. In 2017, Nutonian was acquired by Data Robot (for an undisclosed amount), where Michael has worked since, rising to the title of CTO. Always interesting to follow up on the authors of these classic papers! 