m8ta
You are not authenticated, login. |
|
{1578} | ||
God Help us, let's try to understand AI monosemanticity Commentary: To some degree, superposition seems like a geometric "hack" invented in the process of optimization to squeeze a great many (largely mutually-exclusive) sparse features into a limited number of neurons. GPT3 has a latent dimension of only 96 * 128 = 12288, and with 96 layers this is only 1.17 M neurons (*). A fruit fly has 100k neurons (and can't speak). All communication must be through that 12288 dimensional vector, which is passed through LayerNorm many times (**), so naturally the network learns to take advantage of locally linear subspaces. That said, the primate visual system does seem to use superposition, though not via local subspaces; instead, neurons seem to encode multiple axes somewhat linearly (e.g. global spaces: linearly combined position and class) That was a few years ago, and I suspect that new results may contest this. The face area seems to do a good job of disentanglement, for example. Treating everything as high-dimensional vectors is great for analogy making, like the wife - husband + king = queen example. But having fixed-size vectors for representing arbitrary-dimensioned relationships inevitably leads to compression ~= superposition. Provided those subspaces are semantically meaningful, it all works out from a generalization standpoint -- but this is then equivalent to allocating an additional axis for said relationship or attribute. Additional axes would also put less decoding burden on the downstream layers, and make optimization easier. Google has demonstrated allocation in transformers. It's also prevalent in the cortex. Trick is getting it to work! (*) GPT4 is unlikely to have more than an order of magnitude more 'neurons'; PaLM-540B has only 2.17 M. Given that GPT-4 is something like 3-4x larger, it should have 6-8 M neurons, which is still 3 orders of magnitude fewer than the human neocortex (nevermind the cerebellum ;-) (**) I'm of two minds on LayerNorm. PV interneurons might be seen to do something like this, but it's all local -- you don't need everything to be vector rotations. (LayerNorm effectively removes one degree of freedom, so really it's a 12287 dimensional vector) Update: After reading https://transformer-circuits.pub/2023/monosemantic-features/index.html, I find the idea of local manifolds / local codes to be quite appealing: why not represent sparse yet conditional features using superposition? This also expands the possibility of pseudo-hierarchical representation, which is great. | ||
{1523} |
ref: -0
tags: tennenbaum compositional learning character recognition one-shot learning
date: 02-23-2021 18:56 gmt
revision:2
[1] [0] [head]
|
|
One-shot learning by inverting a compositional causal process
| ||
{1527} |
ref: -0
tags: inductive logic programming deepmind formal propositions prolog
date: 11-21-2020 04:07 gmt
revision:0
[head]
|
|
Learning Explanatory Rules from Noisy Data
| ||
{1510} |
ref: -2017
tags: google deepmind compositional variational autoencoder
date: 04-08-2020 01:16 gmt
revision:7
[6] [5] [4] [3] [2] [1] [head]
|
|
SCAN: learning hierarchical compositional concepts
| ||
{1388} |
ref: -0
tags: PEDOT PSS electroplate eletrodeposition neural recording michigan probe stimulation CSC
date: 04-27-2017 01:36 gmt
revision:1
[0] [head]
|
|
PMID-19543541 Poly(3,4-ethylenedioxythiophene) as a micro-neural interface material for electrostimulation
| ||
{1307} |
ref: -2000
tags: polyimide acrylic aluminum electro deposition imide insulation ultra thin
date: 02-27-2015 19:42 gmt
revision:0
[head]
|
|
Ultrathin, Layered Polyamide and Polyimide Coatings on Aluminum
| ||
{1207} |
ref: -0
tags: Shenoy eye position BMI performance monitoring
date: 01-25-2013 00:41 gmt
revision:1
[0] [head]
|
|
PMID-18303802 Cortical neural prosthesis performance improves when eye position is monitored.
| ||
{34} |
ref: bookmark-0
tags: linear_algebra solution simultaneous_equations GPGPU GPU LUdecomposition clever
date: 0-0-2006 0:0
revision:0
[head]
|
|