| CARVIEW |
I’ve been reading some of Yudkowsky’s Sequences recently and I’ve got to say, even though I think Eliezer’s conduct on twitter and views on safety are fairly extreme, I really like the way he thinks about probability, information, and causality. He clearly has an intuition built from thinking deeply about the implications of basic ideas, and I can respect that. I read somewhere that people complain that he doesn’t bring any new ideas to the table. I disagree. I think in the graph of ideas, he hasn’t found new nodes, but he has a unique and interesting set of edges that map a bit better to human intuition than normal.
The thing that annoys me about the LessWrong brand of rationalism and probability theory is that it often isn’t careful about the specifics of what it’s talking about, and so it’s impossible to develop new intuition or even bake the intuition that exists into pre-existing mathematical machinery. It feels sort of like he’s telling a tribesman in new guinea that a computer is sort of like a vision that lives on the water, and the vision tells you things if you act in certain ways or press on the water, and that’s how he knows what the weather will be tomorrow. The tribesman doesn’t know what a computer is after this and certainly couldn’t build one, even if this conversation gave him some fuzzy intuition.
I have a hardline opinion that the importance of baking new concepts into pre-existing mental machinery is the most underemphasized thing in all of education, and that if schools were built around adding to existing knowledge rather than creating temporary isolated subgraphs that disappear after exams, we’d be a lot better off as a society. My specific experience has been that if I don’t add edges to what I already know, I lose a new node of knowledge almost immediately; however, if I bake it in, it permanently adds to my map of the world.
The specific set of examples I’m thinking about with this lesswrong stuff is all the times Eliezer talks about probabilities and the scientific method in terms of possible-future-worlds without connecting it to any formalization. So let’s get clear about what the fuck we’re talking about and build a proper formalization that maps on to existing machinery.
Defining Universes
First, we need to define what we mean by “possible future worlds”. Probability uses sets under-the-hood to define “things that happen”, and includes a probability measure $P$ that maps sets to scalar values between 0 and 1. So, let’s map out a universe. Set an origin point (I prefer the location right in front of my nose). Call that origin point $\mathbf{0}$. Now define a cartesian coordinate system in 3D spreading out from that origin point. It extends across the entire known universe. Units are totally arbitrary, so let’s say that a meter is one unit. Now the location of every point in the universe is associated with a 3-vector $\mathbf{p}$.
We can define shapes in 3d with planes of boundary points. There can be objects at these planes. Let’s build a simple abstraction: every point is associated with some density (characterizing how many mass-units are in its local volume). We can think of every* such point as indices into a tensor, with the scalar value at the index being the local density. If we do this, the universe becomes a big tensor $\mathbf{U} \in \mathbb{R}^{p \times p \times p}$.
$\mathbf{U}$ defines the state of the universe at a particular point in time. But we don’t just want frozen points, so let’s add a time dimension, so that we can move along it and watch the universe change. We get $\mathbf{U} \in \mathbb{R}^{t \times p \times p \times p}$.
(*One problem with this growing formalization is that we’re forced to discretize, meaning, if we zoom in enough we lose continuity. The fix for this is saying that $\mathbf{U}$ is a function: you give me four real numbers, and I give you the local density at that point in spacetime. But let’s think with tensors for now since that’s easier to work with, and all the intuitions are the same).
There are problems with this formalization. For instance, there are different types of densities: One $kg/m^3$ of hydrogen is different than a $kg/m^3$ of Oganesson-118 (an incredibly unstable superheavy element). But we’ve reached an important point: we have the exact amount of formalism that we need to move on into probabilities, and we can just figure that there’s some close-to-isomorphism that puts the rest of the details of the actual universe into the same-ish mapping. (I am comfortable being handwavy here because we already have sufficient detail to get into probabilities)
Probabilities of Possible Universes
An outcome in the formalization we’re building is exactly one state of the world at a particular time: $\mathbf{U} \in \mathbb{R}^{t \times p \times p \times p}$. But in probability theory, probabilities are functions defined on sets of outcomes called “events”.
What do we want events to be? Well, we are considering different possible universes, and we want to be able to get at probabilities of possible futures, or of possible alternative realities that fit some guess or another. This is just a bunch of $\mathbf{U}$s. So an event is some collection of universes $E = {\mathbf{U} \in \mathcal{U}}$. We call a particular $E$ an event, and $E \subseteq \mathcal{U}$. Here, $\mathcal{U}$ is the set of all possible universes: it’s every 4-tensor that exists. There is some subset of $\mathcal{U}$, linear over $t$, that defines the spacetime trajectory of our actual universe (we can call this $\mathcal{R}$ if we need it, for ‘real universe’). We often want to condition on all ${\mathcal{R}: t \geq t_0}$, where $t_0$ is the present time, to think about future events (however, we don’t ever have perfect information, so we always must approximate that conditioning). The set of possible universes that we might actually care about is some low-entropy manifold on $\mathcal{U}$, because if you just pick a random outcome/universe $\mathbf{U_i}$ it will almost certainly have roughly uniform density across the whole tensor, like noise.
Now we can start to get specific, and we can start bringing in the machinery of probability, because we know what we mean by ‘outcome’ and ‘events’. The set of all possible events is called the event space $\mathcal{E}$. The probability measure $P$ simply maps events to probabilities, $P: \mathcal{E} \rightarrow \mathbb{R}^{[0,1]}$, such that $P(E) = p$. Each outcome isn’t associated with a probability, but with a local probability-density. We integrate over outcomes (“possible universes”) to create probabilities. (consider that we are integrating rather than summing - because each element of our 4-tensor exists on the real number line, the set of universes exists on a continuous space).
$\mathcal{U} \in \mathcal{E} $ is also called the sample space, and it has probability 1. Every other event is a subset of $\mathcal{U}$. If we divide our sample space into disjoint events (remember, “groups of universes”), $E_1, \cdots, E_n \subseteq \mathcal{U}$, then $\sum_i P(E_i) = 1$.
We often want to define events based on some condition about the world. The condition can be anything. “the set of universes in which my hypothesis about the world is true” is a useful family of conditions. “The set of universes in which I guess that I hung up my keys by the door, and then my keys are actually by the door” is a member of that family, and defines one event.
The Scientific Method
The box below (lifted from the 3blue1brown video on bayes’ theorem) represents $\mathcal{U}$. The entire box has probability $1$ (think about it as area). $E$ and $H$ are both collections of universes (somewhat confusingly, because I didn’t bother to change from 3b1b notation to my own, ‘E’ stands for ‘evidence’ in this box rather than being used for ‘event’, and ‘H’ stands for ‘hypothesis’).
$P(E \mid H)$ means: zoom into the part of the box corresponding to $H$, the bright-gray column on the left, and look at what proportion of that $E$ (the bright blue column) takes up. Mapping everything I built up above into this box visualization: A particular universe $\mathbf{U}$ is a singleton point on the box. Some event $E$ is an area in the box. The event space $\mathcal{E}$ is the set of all areas of the box (think an infinite number of bounding-boxes or weird shapes). Conditioning on an event means zooming into that part of the box and making it your entire universe. $R$ is some very small area in this box, and in practice we pretty much always want to be conditioning on the low-entropy manifold that $R$ lives on (think about a wavey sliver of the box - shapes we condition on don’t have to be rectangles).

Let’s say I’m publishing a paper and I have some data $D$ (“D” is shorthand for “the set of universes $E_D = {\mathbf{U}_i}$ in which the data was measured”). I want to figure out if some hypothesis $H$ that I have is true (a “hypothesis” is a set of universes $E_H = {\mathbf{U}_j}$ in which a guess I have about the universe is true).
Well, we can think about this: if I take all possible disjoint hypotheses, then they cover the entire sample space, $\bigcup_{i} H_i = \mathcal{U}$, and some subset of them has to overlap with $D$, e.g., be true: $D = D \bigcap \mathcal{U} = D \bigcap (\bigcup_{i} H_i)$ (this is the same as the law of total probability; we’re just not applying the probability measure after the set operations).
The human brain can’t literally figure out all possible hypotheses that fit observations that we see, so we do a brainstorming session to take the top $k$: ${H: P(D \mid \bigcup_{H_i}) > t}$, and we want the threshold $t$ to be as high as possible given the time we’ve given ourselves to come up with hypotheses.
Zoom into one hypothesis with high probability $H = \argmax_{i} P(D \mid H_i)$. Remember that this is an event: a set of possible universes in which our hypothesis was true. What we’re really doing here is trying to find the $H$ that maximally overlaps with $D$, e.g., maximizing some notion of the volume of $(H \cap D)/(H \cup D)$ by varying $H$ (after defining what we mean by ‘volume’). Finding the hypothesis that maximizes this intersection-over-union is the same thing as finding $\max_{i} P(H_i \mid D)$ (the best $H_i$ we’ve found so far is what we call a ‘theory’ in physics). This is (related to) what yudkowsky means when he says “a map that corresponds to the territory”.
Improving Scientific Thinking
Now that we have a somewhat reasonable formalization, let’s use it to make observations that make our scientific thinking more clear and give us action items when it comes to things like experimental design.
What does it mean to collect more data?
Collecting more data means that we now condition on $\hat{D} = \bigcap_i D_i$ instead. This is a smaller-volume subset of universes, and consequently there aren’t as many compatible hypotheses (e.g., it’s harder to overlap in our intersection-over-union). This is a good thing: it means that it’s easier to pick hypotheses during our brainstorming sessions that might ‘match’, because we’ve narrowed down the search space.
What type of data should we collect?
It seems to me like we would want to collect data that maximally reduces the volume of $\hat{D}$ (we really should define what we mean by ‘volume’ - I guess this is just the Lebesgue measure?). That way we’re maximally restricting the set of hypotheses compatible with that data (e.g., the ones with high intersection-over-union). The best way to do this is to find a hypotheses as disjoint with what we’ve already conditioned on as possible with as large a volume as possible, and then eliminate or not-disprove it by performing an experiment that could result in collecting data incompatible with the hypotheses.
What does it mean to eliminate a hypothesis?
We’ve conditioned on our data, so (in the basis of our data rather than the basis of $\mathcal{U}$ or $\mathcal{R}$), we have $P(D) = 1$. So if we had access to a full set of hypothesis that together fully overlaped the volume of our data, then we’d have $\sum_i P(H_i \mid D) = 1$. If we eliminate a hypothesis $H_j$ by collecting new data $\hat{D}$, then in the basis of that new data, $\sum_{i \neq j} P(H_i \mid \hat{D}) = 1$. Our other hypotheses weren’t eliminated; so the amount of probability that $H_j$ was responsible for must be dispersed into the other hypotheses.
(I’ll actually make the observation now that we should be taking integrals rather than summing, and doing everything on a continuous space as defined before, but I don’t feel like going back and fixing this, so just note that this is true and make the appropriate changes in your mental model, it’s not that different)
How can we incorporate causality?
In the causality literature, $P(A \mid do(B))$ means: let $do(B) = E_{do(B)}$ be the set of universes in which we force $B$ to occur. In what proportion of these universes does $A$ also occur?
This is different from $P(A \mid B)$, because the set of universes in which we force $B$ is different from the set in which $B$ occurs naturally. For instance, suppose $B$ is the set of universes in which I carry an umbrella, and $A \mid B$ is the set of those in which it also rains. Normally, a very large proportion of the $ \mathbf{U} \in B$ are universes in which I also expect rain. However, $do(B)$ zooms in on a very particular subset of $B$, in which I have the umbrella because I was just given it by force rather than because I expect rain. My expectation of rain is the true reason that $P(A \mid B)$ was high; removing the universes with this expectation makes the actual rain probability plummet. (We can map ‘expectation of rain’ in as well by defining it as a set of universes $\{U_{Ra} \}$)
(This one, unfortunately, doesn’t really get at the essence of judea pearl causality, which is that the $do$-operator severs incoming nodes in the causal graph for whatever it operates on)
Closing Thoughts
It definitely feels like there should be some notion of turning everything into a vector space equipped with an inner product (our IoU metric or some derivative), and then ‘finding good hypotheses’ becomes a search to find the few hypotheses that have high inner products with our data. I’d have to think about exactly how that translation would work. This might be a useful thing to think through later. I suppose orthogonal vectors in this case would correspond to fully disjoint events?
]]>I am about to start a PhD at Northeastern University with David Bau’s lab, and the beginning of a PhD is a great time to think carefully about what areas I can maximally have an impact in.
So, here are some collected thoughts on research directions in interpretability I think would be impactful. A good starting point is to begin with application areas and end goals, then backpropagate into research that could potentially be fruitful for those areas and end goals. Therefore I will begin with downstream research areas I’m interested in, starting with general philosophy, and then move into more specific project ideas.
Philosophy
In general, it seems clear that large foundation models learn useful things about the world. This is true for language, but I think vision and multimodal (and video, which I think will develop over the coming years!) is just as important, and possibly often easier from a research perspective because visual information is just so much more semantically rich from a (human) cognitive perspective than language information. Using prompts to access that information feels to me like an incredibly coarse method of information acquisition, and I think I want one of the central pushes of my PhD to be towards finding richer ways of accessing that information.
Motivations
This is useful for a number of reasons not necessarily transparent to computer scientists with a background entirely in deep learning. I’m coming from a biomedical engineering / statistical machine learning / data science background. In that world, a big source of hesitation in using deep learning - besides the obvious bulkiness involved in training a big model over running sklearn.decomposition.PCA.fit_transform() and the lack of usefulness in low-data regimes - is that it’s much more difficult to find interpretable features. I’m coming out of a very computational drug discovery startup, for instance. In that world, language models trained on RNA are starting to make their appearance (for instance, profluent is active in that space, and my company has started to play with SpliceBERT). Biologists and chemists can’t access the information in these models from prompting, since the whole point is that the model has learned something rich about how RNA sequences are organized in a language that they don’t know. However, these models are clearly learning interesting and nontrivial information about biology, as evidenced by the downstream tasks (branchpoint prediction, splice site prediction, variant effect inference, nucleotide clustering…) they were able to get surprisingly good results on. A microscope to peer at this information directly would be critically useful for scientists.
In computer vision, my girlfriend’s company is another great example. They are using straightforward classification architectures - think EfficientNets and ResNets - to classify veterinary X-rays. One of their big challenges, particularly in moving into human data, is that doctors need to be able to trust the model. In order to trust the model, there needs to be a great deal of transparency in how the model is making its decisions. They’ve been able to take advantage of pixel attribution methods, e.g., SHAP, LIME, and GradCAM, to get some insight. But these methods don’t tell the whole story.
Furthermore, sometimes (e.g., GradCAM, according to her) don’t have easily-accessible implementations that can hook into existing codebases. For cases where the problem is wild-west levels of implementation for useful techniques, it is useful to collect all of them into a single codebase with a methods paper attached. Ideally this codebase would become popular, but trying to get that codebase PR’d into a major library in something like huggingface or something torch-adjacent might have more impact. This worked well during my master’s in the graph world for graspologic, during which time a labmate PR’d his work into scipy.
In general, here are a few areas in which better interpretability would be very useful:
- customer-facing medical applications
- legal applications
- self-driving
- drug discovery
- all of science
- people who want to know why their model isn’t training
Basically, anywhere in which a central push is to learn something rather than do something, and any application area in which there’s a high cost for mistakes.
The biggest, most useful big-picture goal would be: can I create a set of tools which let users tell the extent to which a model is hallucinating, based on what’s happening in its activation space?
More Motivations
Being succesful in accomplishing a long-term goal will require constant adjustment, both in the goal itself and in the actions I take that lead me towards it. In this section I will keep updating this blog post to keep refining what exactly I am trying to accomplish.
Interests
With the above impact motivations in mind, there are also things that just strike my fancy, and which I think are interesting and fun to think about. Good projects, in my opinion, will also contain elements of this. A somewhat representative but not at all exhaustive list:
- Constrained optimization methods which create model edits that respect the data manifold. Riemannian optimization methods, for instance, or using spatial derivatives to move around on manifolds. Methods which take advantage of the tangent space and local linearity.
- In the same vein, activation steering and vector arithmetic in the latent space. Anything that takes advantage of the geometric properties of algebraic spaces to perform some useful and unintuitive task.
- The whole set of LoRA/DoRA/ReLoRA etc. PEFT methods are interesting largely because I have a geometric intuition for low-rank updates – e.g., making linear updates which are constrained to a low-dimensional hyperplane on the original subspace.
- I have a side-interest in explorable explanations and interactive documents, e.g., distill.pub, betterexplained, anything by brett victor, anything by flowingdata.
- Petar Veličković and friends have been working on mathematical frameworks for categorical deep learning recently, extending geometric deep learning to use the tools of category theory to attempt to build a language of architectures. It looks interesting; I don’t know at first glance how it would be useful in interpretability.
- My areas of mathematical interest include causal inference (e.g., judea pearl’s work), high-dimensional statistical inference, information theory, information geometry, and differential geometry. My intuition is that all of these areas contain useful ideas for interpretability research.
Research Directions
With all of the above in mind, here is an equally non-exhaustive list of directions and/or projects. Some of the below is more fleshed out than others; I basically just threw something down whenever I had an idea. My hope is that ideas will be quickly whittled down when they aren’t feasible and/or don’t have clear impact and/or are too vague and ill-defined, and useful ideas can be expanded into concrete projects with clear goals.
- Starting out by simply helping out labmates with their current work seems like a great way to get myself involved quickly with existing research. It seems reasonable to have a first-year goal be to at least touch everybody else’s work in the lab at least once, and contribute something useful for them.
- Papers which explore the same idea, but under a lot of different model architectures, implementation details, etc. Finding a set of golden retriever ear detecting neurons for a ConvNext trained on exactly ImageNet using exactly AdamW at exactly a learning rate of 1e-4 is much more interesting if it’s also true for vision transformers and whatever else, using different optimizers and hyperparameters. Which results generalize most strongly?
- Example: efforts to transfer interp results in GANs to diffusion models, efforts to transfer results in transformers to mamba…
- addendum to this: to what extent do the same concepts apply when we change modalities? Can we discover ideas in interpretability in text, and do they also apply to vision, video, audio…
- Volumetric and video models are cool and largely underexplored right now, primarily (as far as I can tell) because of compute constraints. Possibly soon, particularly with video, this will be interesting to explore (although moving from $O(n^2)$ to $O(n^3)$ is a big jump). A paper which applies existing image-interpretability techniques to video models could be fun to write.
- For LLMs: Can we find or create a ‘factuality’ or ‘confidence’ direction in activation space, in a way that lets us add that information into model responses?
- for instance, you ask why some code doesn’t work, and the LLM responds, and above the response there is something that says “the model is 60% sure that this answer is correct”.
- The loss function for this particular example would involve measuring the empirical distribution of the actual amount of times it was correct, and then looking at the difference.
- possibly data attribution methods could do this? It feels like the model knows. Sonnet, for instance, says “I’m sorry” and changes its mind when you probe it on something that was wrong. But when you probe it on something that was correct, it’s firm and tells you why it was right.
- What kinds of interpretability questions can we ask about the kinds of updates PEFT methods are constrained to make under the low-rank regime? Is the difference between a low rank update and a full rank update just a quantitative change, or is there a qualitative difference in the kinds of updates that can be made?
- In the same vein: What are we actually doing geometrically to weight matrices by adding low-rank information into them? I suppose (via geometric intuition rather than formal proof, so possibly wrong) that higher-magnitude low-rank updates causes weight matrices to be closer to low rank themselves. Do we care about this?
- Something that would be very useful if solved is “densifying” context windows. Say you’re at the edge of a very large context window. Is there a way to map the set of token embeddings you’re conditioning on to a lower rank matrix, or to a single information-rich vector, or just anything more compressed, so that you can still condition on the context without running out of space? This is not necessarily interpretability research, but it is interesting.
- the attention operation, viewed as a linear transformation, has access to (e.g., not in the nullspace) at most a hyperplane of dimension
mwhen operating on an embedding matrix wheremis the number of tokens.- Multi-headed attention takes subsets of dimensions, and moves the
mvectors around on the hyperplane within those subsets. Then they are concatenated. - is this linear? The softmax in the attention matrix isn’t, but can you pull out all the softmaxes?
- If the above is true, can you shuffle things around algebraically so that the splitting, attention-updating, and concatenating all happen in a single matrix multiplication (block attention matrices * head concatenations), since it’s all one linear transformation?
- if you can do the above, is there a technique that makes doing it this way faster, since then you’re taking advantage of optimized matrix multiplication on the gpu as much as possible?
- Multi-headed attention takes subsets of dimensions, and moves the
- My competitive advantage is a geometric intuition for linear algebra and being able to take ideas from the graph stats textbook. How many ideas that I know well from textbook could possibly be useful in deep learning for interpretability research?
- MASE and OMNI embeddings - joint representation learning for multiple graphs (project nodes from different graphs into the same subspace) - possibly useful for comparing attention matrices
- I wonder if the spectral embedding of the Laplacians of attention matrices tells us anything
- graph matching - maybe not useful? I could potentially see an experiment where we look at whether attention matrices become harder or easier to graph-match (e.g., the loss function gets bigger or smaller) in deeper layers.
- random graph models and their properties, e.g., erdos-renyi, SBMs, RDPGs, etc - probably not useful here. Maybe as a statistical framework. Not sure. Don’t see it right now.
- Network summary statistics: network density, clustering coefficient, shortest path length, average node degree, etc: would have to binarize attention matrices with a weight threshold for most of these, but looking through these statistics across attention matrices might tell us something interesting. For example, do attention matrices tend to be more or less dense than random graphs?
- signal subgraph estimation and anomaly detection in timeseries of networks: There is an interesting interpretability question here: which signal subgraph, e.g., subsets of token-nodes, tend to change the most over time?
- Scree plots and automatic elbow selection: Could also be interesting. This helps answer the question: how does the latent dimensionality of attention matrices change across heads and layers? is it fixed, or does it change?
- Community detection: how do nodes in different attention matrices cluster? Do they tend to form the same clusters across heads, or different clusters? what about across layers?
- vertex nomination: find communities for community-less nodes when we know the communities of the other nodes. Don’t think this is super interesting here but maybe I’m wrong.
- out-of-sample embedding: Estimate the embedding for a new node given the embeddings of existing nodes. You need to have already decomposed a graph with SVD in order to be able to do this. I suppose you could measure the extent to which you can estimate what the embedding for a token should be given the embeddings of other tokens, but would have to think about downstream use cases for this.
- two-sample testing: test whether two attention matrices are drawn from the same distribution. Could be interesting for comparing attention matrices across heads and layers.
- Mixture of Depths is an interesting paper which commits to the idea: “why apply every transformer block to every token when presumably some tokens need more computation than others?”. An interesting interpretability follow-up to this would be: Which tokens ended up needing more compute than others, and what does that tell us about what is easy and difficult to compute for a language model?
- It feels to me like PEFT methods would work better if the weight matrices they are updating are low-rank. There is a more rigorous way to think about this: how much of the mass of singular values is concentrated in the top singular vectors? You can think of this sort of the entropy of the singular values. If the entropy is low, e.g., the magnitude of all the singular values is concentrated in the highest ones, then PEFT methods should work better, because the weight matrices are a closer approximation to low rank.
Whittling it Down
So, after discussing with David, there are a few potential ways these directions are harmonious with existing work:
Erasing
Low rank updates have overlap with unlearning. There are two ways in which you can get a model to unlearn through gradients updates or whatever.
First, you can use a model’s knowledge against itself. So, you ask a model what it knows about X, save the resulting activation vectors, and fine-tune against those vectors. This Second, you can localize the model’s knowledge about X using probes or causal mediation analysis or whatever, and then once you’ve localized, change the model at that location so that the probe can no longer see X. That’s what LEACE does, for instance.
Rohit has a paper in review in this general area called sliders. The idea is that you can modify outputs of diffusion models by using an unlearning loss function to dynamically control stuff like age; erase concepts like ‘youngness’, for instance, so that it always makes people old or whatever. This is already published. The new idea is, instaed of using the erasing loss to permanently change a model, use the erasing loss to train a lora; and then you can just slide the lora up and down.
There is also a visiting HCI student named Imca Grabe, who is training a bunch of sliders. One idea here is: you have a left-hand vector and a right-hand vector, right? What if they were from different sliders? Then you’re composing sliders together and maybe they can do interesting composed things.
Reverse-Engineering PEFT updates
Another line of research is in reverse-engineering. The idea is: Say you edit a concept in a model with ROME or whatever. You do this with a rank-one change.
Anybody can figure out that you’ve changed the model, that’s easy - you just take the difference in weights and see that it’s nonzero. You can also look at the rank of the difference matrix and see whether the change is just a low-rank update or something major.
What you can’t do is figure out what’s actually changed. So the question is: How do you figure it out?
Koyena is working on this right now, so I’d reach out to her and brainstorm.
Function vectors
This is related to the context windows densification idea. There is also an industry application here.
So, the idea is essentially compression. Say you have a huge prompt. Like, you’re a company and you’ve done a bunch of prompt engineering and now you have this gargantuan thing that’s half the context window. It turns out that, under the hood, the model essentially represents this whole thing with a single vector.
So, you can take this vector out, jam it back in later, and voila, the model will do the same thing it was going to do originally; you don’t need all these examples.
Eric is working on this. He’s mostly been looking at it for theoretical reasons, but there’s room for turning it into an industrial-strength method that people would want to use. So, for instance, you have these chipmaking companies that need knowledge in these old programming languages that people don’t know how to use anymore. And copilot doesn’t work either because it wasn’t trained on these languages. So they have these super long prompts that cost lots of money (in token cost) to run inference with.
Instead of these companies writing in these super wrong prompts, we can basically compress their big prompts into a single vector, and then they can just use that vector to get the same results for cost savings. There are various techniques for compressing context instructions, we’d do a literature review. But here we’d try to beat a benchmark: can we use function vectors to compress in ways that beat the current state of the art?
Mechanisms of pause tokens
Related to the question about Mixture of Depths, as well as the question about ranks of attention matrices when viewed as linear transformations, is the question: what are the mechanisms of pause tokens? Why do they work and what do they do? What kinds of extra computations become available when we add pause tokens?
A guy named Rihanna Brinkman is doing some stuff related to this. There’s also a student in Chris Manning’s lab at Stanford, Shikar, who is doing a research project around this. Learning why pause tokens work seems potentially fruitful; it seems like the type of thing that, if figured out, would open more research doors.
Linear Relations and graphs
This has to do with the linear relations paper.
Take the phrase “The Eiffel Tower is located in the city of Paris”. Run this through a forward pass and get the activations. Take the hidden state for Eiffel Tower in an early layer, and for Paris in a later layer. What is the relationship from one token to the other?
Well, take the Jacobian of the thing and you have this locally linear mapping between the two. But the cool thing is that then, using this Jacobian, you can, say, replace “eiffel tower” with “Great Wall”, and the later hidden state will now output “China” – with the same Jacobian, using the same linear transformation matrix!
So this tells us that transformers are not just locally linear, but they are linear over a pretty broad area.
Evan Hernandez at MIT wrote the linear relation embeddings paper, but he just graduated.
Blog Post Ideas
- How much longer would it take to train a transformer if you expressed everything in residual stream format?
More Ideas
- Not interpretability, but initialization seems weirdly underexplored. Is it possible to distil a transformer down to an MLP, and then reverse-engineer what the nearest easy initialization is? If someone could figure out how to encode the attention mechanism inside an MLP, for instance, that would be a massive breakthrough.
- what if instead of next word prediction, you predicted the embedding for the next word, using the loss of some embedding distance metric instead of cross entropy?
- Could test a bunch of different distance metrics, using cross-entropy as a baseline, and see if any of them is better
- Nobody seems to care about proper model initialization because it isn’t flashy. Is correct model initialization solved for every type of common layer? Is there a guide somewhere that says “for this type of layer, use this type of initialization” for all common layer types? If not, could I write a review paper?
- by ‘correct model initialization’, I mean: ‘model initialization such that, when data with mean 0 and variance 1 is passed through the untrained model, mean and variance statistically stay 0 and 1 respectively at every layer’
- I ask this because many implementations I’ve seen only change the initialization for linear layers, even relatively well-known models (dinov2, for instance)
- Anthropic recently came out with some great sparse autoencoders work which seems to open a lot of doors. Near the end they track activations for the features which combine to result in a particular output, with examples being a bunch of sentences and how strongly these sparse features they created (from the SAE) fire for each of the tokens in the sentences.
- I wonder if we could use this to figure out what in the training data created that feature in the first place. Could you use the strength of activations in training tokens to create a ‘map’ of where in the training data the model is using to make its decisions?
- This could be very broadly useful: companies training models could use this to get much more specific with what their model ends up learning, by ablating sections of training data they don’t want (based on this map).
- Method: Using only activations as input, train a network to predict what the output will be. Which layers contain the most output-relevant enformation?
- Method: Use SAE features to build better search pipelines. A k-nearest neighbor search on an SAE feature space might be much better than one on a polysemantic feature space.
- Method: initialization technique –> warm up for some number of epochs on a loss function that encourages a weight space such that mean is 0, variance is 1 for activations. Then replace with the real loss function.
- Benchmark / Dataset: Pieces of code followed by what happens if you execute the code. Rather than next-token prediction, it’d be: predict terminal output given code input. Problems:
- What do you predict? You’re not predicting one token at a time, so it would need to be something other than a probability distribution over tokens. Predicting every token of output seems computationally intractable, because of combinatorial explosion when you try to predict n tokens at once. Maybe there’s a way I don’t know about.
- The extent to which linear updating methods work depends on the curvature of the concept-space corresponding to that feature. Quantifying the curvature using methods outlined here should tell you the extent to which editing methods work.
- Crazy idea: what if I recreated this paper, but for Starcraft 2, and AlphaStar? https://arxiv.org/abs/2310.16410
- An algebra of editing methods: take n model-editing techniques. What algebraic properties do they have? e.g., are they commutative? associative? invertible? transitive?
- One of the most famous studies of all time in neuroscience was the discovery of parrahipocampal place cells. Rats were led through a maze, and their location in the maze was reconstructable from place cell firing patterns. Can I do an equivalent study on video transformers? e.g., are there neurons or activation directions responsible for where objects in the video are located?
- proof of concept for this would be: changing one of these activation directions or whatever and showing that it moves objects around in the video that gets generated (this would work for image transformers as well).
- the end goal of this system would be precision editing. Change activations to move objects in a picture around (in a diffusion model). Change activations to change where objects are predicted to be via bounding boxes (in an image transformer trained on ImageNet).
- Fuzzy idea for another way to get function vectors. I was using Claude to rewrite text. At the beginning of the session, I said something like “I want to improve the sentence structure in this text, …”. Then I pasted in some text and it improved the structure. Then I pasted in more text and it kept improving the structure. I wonder if you could marginalize out all the times that it improved structure to get at the instruction itself, across all the responses.
- To what extent is this video true? e.g., could I try to find an attention head whose query/key matrix searches for adjectives behind nouns, and then updates the vectors for the nouns according to those adjectives? (If not a full paper, this would make for a great blog post!)
- Feature importance technique: Quantifying which features in an LLM are the most monosemantic is likely a good feature importance metric, because features with their own orthogonal directions contain the least interference with other features when matrix-multiplied (and thus, are more ‘valuable real-estate’)
- Creating a Priviliged Basis: Can you train an orthogonal procrustes iteratively at each layer of the residual stream to map everything to the same basis? (this thought may be half-formed because I haven’t thought very deeply about what ‘privileged basis’ means yet)
- Negative studies, up the alley of ‘sanity checks for saliency maps’. Take an untrained transformer with random weights. How many common single-neuron results are reproducible with standard techniques on this? It seems like if you’re probing for a neuron that does a particular thing, for instance, one that activates for compound words, many techniques I’m reading about would be able to find extract one from a random network, just because there are a lot of parameters and some of them are randomly gonna activate for concepts.
- the whole issue with superposition is that you can’t have more than d orthogonal vectors in a d-dimensional vector space. However, does this change when we think about floating-point numbers? e.g., if we represent vectors with 16 bits, 32 bits, etc, is there a dimension size at which there can be more than one orthogonal vector, just because of numerical precision?
- Transformers, since they are recurrent, can be decomposed into the sum of a bunch of different smaller networks, e.g., they can be thought of as an ensemble. What are those smaller networks doing? can we run their ouputs through tuned lens or logit lens or whatever to see how each contributes to the result, or causal tracing?
- what does the relationship between deleting subnetworks and performance look like? could you make a model smaller and not lose performance by deleting non-important subnetworks?
- model stitching would be extremely useful if figured out, because it would allow us to usefully combine the abilities of more than one foundation model.
- for david’s class: reproduce anthropic’s induction heads ICL loss thing, and then do it with random text instead of ICL data
- test the hypothesis that facts are stored in MLP layers by fine-tuning a model on new facts with MLP layers frozen and seeing if it learns them. Then train with attention layers frozen and see if it knows facts as well as an unfrozen model would.
- is there room for a sanity-check-for-saliency-maps style attack on SAEs somewhere? seems to be in the zeitgeist
- say we do topic clustering on a big factual recall dataset. Is there a relationship between which MLP layers have the highest causal impact and what the topic is? e.g., is knowledge localized by topic at all?
- Can you create a new concept, e.g., add a new fact in and then adjust the probability of that fact being returned inside some conditional token distribution?
Anna and Mark sat across from each other at dinner, both exhausted from long workdays. Anna went to the fridge and grabbed some sliced mango, Mark’s favorite snack, which she had purchased earlier that day for him despite her exhaustion. Later that evening, when Anna’s migraine struck, Mark silently dimmed the lights and brought her medication and a glass of water, sitting by her side. Neither kept score; both simply responded to the other’s needs without hesitation.
Sarah spent every Friday night for three months helping James prepare presentations for his startup’s investors. When her father suffered a heart attack, James texted that he was “swamped with deadlines” and couldn’t visit the hospital. Later, when she lost her job, he was annoyed with her; had she been working hard enough?
David and Elena maintained separate bank accounts, divided household bills exactly in half, and kept meticulous records of who last bought groceries. When Elena fell ill, David reminded her that he’d “covered” for her last month when he drove her to the airport. Their apartment functioned with cold efficiency, but warmth and spontaneity were noticeably absent.
The ideal type of human relationship, both romantic and platonic, involves mutual reciprocity. Each person gives selflessly without expectation of return. However, this works only when neither party exploits the other.
Consider two people, each choosing either selfless love or self-interest. There are three possible outcomes: (1) both love selflessly, (2) one person loves while the other acts selfishly, (3) both act selfishly.
When both people love selflessly, we find happiness, warmth, connection, and mature reliance. Each person nourishes the other. Richard Davidson, a neuroscientist at the University of Wisconsin-Madison who studies meditation and has a close relationship with the dalai lama, once did a study on altruism. He divided people into two groups and gave each $200, with the requirement to spend it by the end of the day. The first group was told to keep the money for themselves and do whatever they’d like. The second group, was told that they could only spend the money on others. The second group reported feeling massively happier than the first group. The effects weren’t even close; effect size was pretty much as massive as you can get in a psychology study like this.
When one person loves while the other acts selfishly, vampirism occurs. The loving person becomes drained and are at risk of becoming jaded and selfish in future relationships. The “loving” one here also isn’t doing their partner any favors. By allowing the other person to act selfishly, they are teaching their partner that he or she can still get their needs met. And they won’t even be content - without the opportunity to act altruistically, the partner loses the happiness they could have gotten from it.
When both partners act selfishly, we have a stable state far inferior to mutual love. Both parties look after themselves and it’s more like a roommate situation than a mutually nourishing relationship.
The key to creating environments in which option (1) is the most likely is social trust. We need to believe that acting kind to others will not result in negative consequences. This happens through repeated direct experience. Culture plays a huge part and can generate either a positive or a negative feedback loop. In the positive feedback loop, members of a community are encouraged to act in service of each other, and respected when they do. This generates more people acting altruistically, which encourages even more people towards service, and so forth.
In the negative feedback loop, cultures embrace selfishness, and the selfless people become exploitable rather than honerable. The smart ones quickly switch to selfishness; the others keep geting taken advantage of until they learn.
I am concerned that the selfish feedback loop has gained popularity over the past twenty years. Consider common phrases: “Always look out for yourself,” “Focus on self-care,” or an excessive emphasis on leaving relationships whenever any problems pop up. I have ideas for why this is happening (and I’ll also need to show some evidence that it is, because it could also be the case that I’m just idolizing the past).
We remain social creatures requiring nourishment from others and ourselves, and denying this need contradicts our basic humanity.
]]>