Progress on Attention

Adam Jermyn, Jack Lindsey, Rodrigo Luger, Nick Turner, Trenton Bricken, Adam Pearce, Callum McDougall, Ben Thompson, Jeff Wu, Joshua Batson, Kelley Rivoire, and Christopher Olah; edited by Jack Lindsey

We report some developing work on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

Last year, we announced that studying attention superposition was one of our team’s priorities, motivated in part by a toy model demonstrating attention features in superposition. Since then, we’ve tried several ideas to study this problem in real language models, and we’d like to give an update on what we’ve tried and where we think the problem is now.

At a high level, we found significant evidence of attention superposition, and additionally of cross-layer attention representations. We looked for and did not find evidence of preferred OV dimensions for attention featuresOne might have expected to find that features either prefer to have one dimension, to encode a skip-trigram, or else a high dimension, to copy large chunks of information. Instead we see a continuous spectrum.. We further found evidence that the QK condition (“where is information moving to/from?”) and the OV-condition (“what information is being moved?”) are coupled, and have found a practical method for studying these in the form of Multi-Token Transcoders.

We believe that the core remaining problem is understanding why models attend where they do, which is to say understanding attention pattern formation. We believe we have a promising path forward on this problem, in the form of QK diagonalization, which we report on below.

Factoring the Problem

We want to understand the nature of computation in attention. To get a handle on this problem, we’ve found it useful to decompose it into a number of smaller pieces, each of which would either inform further investigations or else provide a practical tool for analysis if solved. Those pieces are:

We began our study with an ambitious approach that, if successful, would have answered most of these in one go. We then conducted several more targeted studies to solve individual pieces, and we now have preliminary results on several of these.

In the rest of this update we document our approaches to date, explain how each has updated us, and provide a roadmap of next steps.

Approaches

Attention Replacement Layer

Our first approach was an “Attention Replacement Layer”. The basic idea was to form a wide (many-head) attention layer and train it to mimic the behavior of an attention layer in the base model of interest (i.e. given the same inputs, train on MSE versus the outputs of the base layer). The wide model would be regularized to be simple, in the sense of the QK and OV circuits being low-rank. In some variants we also regularized to make the attention pattern sparse. With this approach, an attention feature is a single head in the replacement layer.

We made some progress with this approach, including some signs of life (interpretable heads in the replacement layer). At the same time, we hit significant ML difficulties that we were unable to resolve. Our current view is that it is likely possible to get this approach to work, but not easy.

We also made some conceptual progress pursuing this approach. We studied a wide array of different regularization schemes, and found that a regularization penalty of the form \left[L_{\rm nuc}(W_{\rm QK}) L_{\rm nuc}(W_{\rm OV})\right]^\beta, 1/2<\beta<1 has all of the theoretical properties we want in a simplicity regularizerComputing the nuclear norm directly is expensive, but it turns out that there is a cheap proxy to this when regularizing factored matrices. In particular, if W_{\rm QK}=W_{\rm Q} W_{\rm K} then the minima of L_2(W_{\rm Q})^2 + L_2(W_{\rm K})^2 are the same as those of L_{\rm nuc}(W_{\rm QK}), and the same is true of e.g. functions of the nuclear norm. See Mazumder+2010 and Kunin+2019 for more details., including:

We found some preliminary signs of life with this approach. In particular, we were able to identify heads in the replacement layer implementing specific skip-trigrams and we found heads that appeared to be performing induction. The results were always somewhat messy though, and we spent a fair bit of time trying to understand why.

In the end our most likely hypothesis is that the ML of our setup was not well-optimized. Unfortunately because the replacement layer has a lot of moving parts and was more-different from models we had worked with previously we found it difficult to improve the ML. This was both for engineering reasons (scaling this approach to large numbers of heads efficiently proved challenging) and because there was significant design freedom (multiple hyperparameters for the loss, multiple choices of loss functions, etc.) and no clear quantitative metric for comparing these.

As a result of these challenges, our approach now is focused on getting signal on different aspect of the problem in isolation through a variety of simpler experiments.

Multitoken Transcoders

To get at questions (1,2,3), we trained Multitoken Transcoders (MTC’s). An MTC takes as input the residual stream over a full context at the input to a layer in the base transformer and predicts either the residual stream after that layer over the full context or the output of the attention part of the layer over the full context. It does this with three components:

An advantage of this architecture is that it can implement a powerful fused Attention+MLP operation (with the identity like “attention head” allowing MLP-like computation to be learned), where information is moved and simultaneously run through an MLP, which allows MTC’s to potentially produce simpler representations of circuits by merging Attention-like operations with MLP-like operations that come either before or after the information is movedAmeisen+2025 has a discussion of how transcoders can be used to study circuits in models. In this context, MTC’s can be seen as extending transcoders to incorporate information about attention.

Additionally, MTC’s form attention patterns in the same way as our toy model of attention superposition, which makes their performance something of a proxy for the hypothesis of attention superposition.

We train our MTC’s on MSE relative to the objective, plus an L1 penalty on the encoder activations and a different L1 penalty on the negative pre-activations, which we’ve found to be an easy and effective way to avoid dead featuresThe penalty on the negative pre-activations in some sense sets the desired mean feature density, since it is active on most examples and balances against the regular L1 penalty times the feature density, so the desired pre-activation penalty is often many orders of magnitude smaller than the regular L1 penalty.

With MTC’s we are explicitly not studying QK dimensionality (question 4), cross-layer behavior (question 5), or how attention patterns form (question 6), since we’ve restricted ourselves to a single layer and in some sense inherit the attention patterns from that layer.

We trained MTC’s with 100,000 features on an 18-layer transformer. We found that:

Attention Superposition

To understand the extent to which attention features exist in superposition (spread across heads), we compared MTC’s against a baseline where each feature is bound to and moved by a single attention headWe implemented this by training an MTC with a Top-1 selection on top of the head loading vector.. The single-head MTC’s performed significantly worse on (L0, MSE), usually a factor of 2-4x worse on L0 at matching MSE. We inspected the features in these single-head MTC’s and they looked qualitatively worse than those in regular MTC’s. In particular they had a much higher incidence of attending between seemingly-unrelated tokens / tokens that don’t conform to the pattern shown by most token-pairs where the feature is active.

The fact that restricting MTC features to a single head degrades both performance and interpretability was a moderate update for us towards attention superposition as a hypothesis, especially since MTC’s form attention patterns in precisely the same way as our attention superposition toy model.

QK/OV Coupling

We want to understand the extent to which the QK-condition (“where is information moving to/from?”) and the OV-condition (“what information is being moved?”) are naturally coupled.

If these are coupled, we should expect MTC’s to perform better than a baseline shaped like “move SAE features through attention heads”, since the difference between the two is that the MTC’s can jointly learn what information is moving and how to move it, whereas the SAE doesn’t get to see information about the information movement.

We weren’t sure of exactly the right way to make this comparison, so we compared MTC’s against a few different baselines. First, we compared MTC features qualitatively against SAE features run through the single attention head that most-moves that feature in a given context. We made this comparison with the same contexts for both MTC and SAE features. Restricting the SAE features to use a single head favors the MTC in this comparison, but choosing that head dynamically according to the context favors the SAE, since the MTC cannot make that choice dynamicallyIt would also be interesting to compare the MTC features against SAE features moved through all attention heads, though we have not yet done this.. In this setup, we found that the MTC features looked somewhat cleaner and more interpretable, but not hugely so, which we interpret as a weak update towards the QK and OV conditions being coupled.

As a second baseline, and to get some quantitative signal, we trained a model jointly on an SAE target and an MTC target. That is, we trained an encoder and a decoder as an SAE (with no attention operation), and we trained a different decoder as a Top-1 MTC. The SAE & MTC steps see the same dataset examples and are interleaved. The MTC training in this case detaches gradients on the encoder SAE, so the SAE is the sole determinant of feature activations. The idea in this setup is to see if we can pick better features for modelling attention by accounting for attention in the training process, or if SAE features perform comparably-well.

We qualitatively inspected the features that come from this training process and found that they were substantially harder to interpret than the features we see in either regular MTC’s or regular Top-1 MTC’s. The SAE features moved through a single-head MTC decoder were often quite fuzzy (i.e. broadly-attending) or else polysemantic.

We also quantitatively compared this baseline to regular Top-1 MTC’s, and found that the jointly-trained model performed somewhat better in the low-L_0 high-MSE limit but comparably or substantially worse everywhere else:

Overall, these qualitative and quantitative results suggest that the QK and OV conditions are likely coupled in models, and we should choose approaches to attention that allow us to study them jointly.

Multidimensional OV

Above we found semantically-meaningful clusters of features in head-loading space, including clusters of induction features, clusters of features involving delimiters in text, and clusters of features involving code contexts. Since clustering in head-loading space means a shared attention pattern, the differences among these features have to be in what information they move rather than in their patterns.

This raises a question: are these clusters evidence of attention features with many OV dimensions? For instance, an induction feature might have a large OV dimension so it can move a wide range of different pieces of information. Such features would naturally appear as clusters in head-loading space since our MTC only allows one OV dimension per feature.

Since we know that different attention behaviors require different OV dimensions (e.g. one dimension for skip-trigrams, versus many for induction/copying behavior), we might expect to see many small clusters corresponding to skip-trigrams and related behaviors, and a smaller number of large clusters corresponding to induction/copying and related behaviors.

To study this, we clustered the head-loading vectors of our features using k-means clustering, sweeping over k. Instead of seeing two distinct populations of clusters, we see a smooth, power-law distribution of cluster sizes, and this remains true across many orders of magnitude of k.

This suggests that attention behaviors might form more of a continuum in OV-dimension. For instance, it might be that there are broad induction features that move large chunks of the residual stream as well as narrow ones that move smaller chunks. And similarly there might be generalizations of skip-trigram behavior that involve moving more than one dimension at a time.

One potential limitation of this analysis is that we trained each MTC on just one layer. It is conceivable that, if there is a substantial amount of cross-layer attention superposition, our feature clusters might be “blurred” by being limited to a single layer. We address this limitation using multi-layer methods in the next section.

Attention Value SAE’s

After studying MTC’s, we turned to Attention Value SAE’s, which are SAE’s trained on the concatenated value vectors of multiple attention heads.

Cross-Layer Representations

We first used Attention Value SAE’s to study cross-layer attention representations. Concretely, we trained three Attention Value SAE’s on an 18-layer model: one on the concatenated value vectors of the heads in layer 4, one on the concatenated value vectors of the heads in layer 10, and one on the concatenation of the value vectors of the heads in both of these layers. We chose layers 4 and 10 because they contained the clearest induction heads, and so seemed interesting to study.

We compared the two-layer SAE to the two one-layer SAE’s by adding the L_0’s of the two single-layer runs and combining their normalized MSE’s (summing and dividing by \sqrt{2}). Across a wide range of sparsity penalties, the two-layer SAE performed better than the combination of the one-layer SAE’s, achieving roughly half the L_0 at fixed normalized MSE or roughly 0.3 better normalized MSE at fixed L_0. This is significant evidence in our view of cross-layer attention representations, and an update towards approaches that can capture properties of multiple layers at once.

Multidimensional OV

In our study of MTC’s we used feature clustering to study the distribution of OV dimensions of attentional features. One concern we had is that our results may have been distorted or “blurred” by the MTC being limited to a single layer.

To address this, we repeated our analysis using an Attention Value SAE trained on two layers. We constructed a head loading vector for each SAE feature using the norm of the decoder vector (per head) as a measure of head loading. We then clustered these vectors as before, using k-means clustering and sweeping over k.

The resulting clusters look very similar to those we found with a single-layer MTC: we see a power-law distribution of cluster sizes and no evidence of discrete populations of OV dimensions.

Attention Output SAEs

Another strategy we have explored is simply training SAEs on the output of attention layers, as explored in this prior work. As in prior work, we observe that these features are broadly interpretable and include induction features and succession features. In addition, we have explored combining them with cross-layer transcoders (CLT’s, which predict MLP outputs) to compute attribution graphs that include attention information. This improves on our CLT-only attribution graphs by localizing attention computation to particular layers and context positionCLT-only attribution graphs integrate across all possible paths through attention heads in all layers to compute edge strengths., and allowing us to see a feature visualization for each attention-mediated step. We have found this moderately useful in making attribution graphs more interpretable, but not as much as we expected. Attention output features often seem similar to those found in a CLT-only attribution graph, or easily “guessable” from them.

Notably, attention output SAE features do not help us address the question of how attention patterns are formed. In addition, compared to MTCs, attention output features provide less information – attention output SAE features have no fixed assignment to attention heads (they receive variable-by-prompt head contributions), and we can only visualize their “query-side” activations (unlike MTC features where we can visualize both key-side and query-side activations). For these reasons, we generally find MTC features to be preferable. However, attention output SAEs are somewhat more convenient to train (in particular, we can train them using only one token position’s activations at a time, unlike MTCs where we need to train on full contexts).

We have also explored cross-layer variants of attention output SAEs that integrate with cross-layer transcoders. One architecture we have found useful for attribution graphs is as follows:

The reason Type 2 features need to be weakly causal is that attention layer outputs are not fully predictable from the preceding residual stream at a single token position. We find this architecture improves MSE/L0 and simplifies attribution graphs somewhat, compared to combining CLTs and plain attention output SAEs. We take this as some evidence for cross-layer representations in attention outputs.

Attention Pattern Formation

Aside from the Attention Replacement Layer, none of the methods above can tell us, even in principle, about how attention patterns are formed. In many cases this is fine: we can interpret MTC features just by looking at examples they fire on and studying the resulting attention patterns. But in many cases the important question is why the model attended where it did.

For instance, in answering a multiple choice question the model has to choose one of the answers to attend to. All of the interesting logic is involved in this choice! So we really do want a means to understand how attention patterns form.

To this end, we are pursuing an idea we call “QK diagonalization”. The basic premise is to route the query and key vectors from a given attention head in the base model through separate encoders to form query-side and key-side features, respectively. The idea behind “diagonalization” is to then only consider the interaction between the i^{\rm th} query feature and the i^{\rm th} key feature. Each interaction corresponds to a “QK feature”, which yields a n_{\rm ctx} by n_{\rm ctx} matrix of attention scores. We are experimenting with both rank-1 QK features (where the query-side and key-side encoders are vectors, and the corresponding attention score matrix for each feature is rank 1) and high-rank QK features (where the query-side and key-side encoders are matrices, their interaction is computed using a dot product instead of a scalar product, and the corresponding attention score matrix may be high-rank). We then sum these feature-wise attention scores across all features and softmax the result to obtain the reconstructed attention pattern for the given head. During training, we penalize the MSE between this pattern and the base model pattern; we also apply various sparsity penalties (separate L1 penalties on the query- and key-side activations; an L1 penalty on the product of the activations; and a negative pre-activation penalty on the query- and key-side features, as explained above). In followup experiments, we jointly trained on all attention heads within a given layer, allowing features to read from the keys/queries of multiple heads and write to the attention patterns of multiple heads: in these cases, how much a feature writes to a given head is determined by a learned head weights matrix. We also considered variations of the setup in which we use the features to predict the attention patterns directly (i.e., a linear version in which we don’t apply the softmax).

Our results so far have shown some signs of life, but using rank-1 QK features it has proved challenging to find a setup that faithfully matches the base attention patterns (i.e., normalized MSE \lessapprox 0.25) with reasonable L0 (\ll d_{\rm head}). One interesting (yet puzzling) result we found is that in almost all variations we tried, we find a relatively small (\mathcal{O}(1000)) number of induction features that are doing nearly all of the work. The remaining features – and this is true regardless of the total number of features – contribute negligibly to the MSE (as measured by ablation experiments).In the single-head experiments, the non-induction features are usually completely dead, or so rarely active that they don’t meaningfully contribute to the MSE. In the multi-head experiments, if we don’t place constraints on the sign of the head weights, we find that these features tend to cancel out; half contribute negative attention scores and half contribute positive attention scores, resulting in net zero effect on the MSE.

This behavior – in the multi-head case in particular – is puzzling because the model is choosing to increase L0 without any improvement in MSE, again, regardless of the total number of features. Unlike when training autoencoders, increasing the total number of features here does not improve MSE (at fixed L0). This points at either an architectural issue – perhaps it is just not possible to faithfully model attention pattern formation using a reasonable number of rank-1 QK features – or an ML issue in our training setup. We suspect it may be the former, and as a result we are now pursuing the high-rank QK feature approach.

That said, the induction features we do find using the rank-1 approach are typically quite interpretable and often significantly more complex than token-based induction. For instance, we have found features that mediate induction on surnames, on syllables within words that start with a C or a K, or on words related to cardinal or spatial directions.

We have only been working on this approach in earnest for a short time, so we expect that there is more to be learned by pushing more in this direction.

Conclusion and Outlook

For the first time it feels like we have real traction on understanding attention. In particular, we are now more confident that there is attention superposition in models across both heads and layers. We have updated away from thinking that attentional features form discrete families grouped by dimensionality. And we have developed a practical method in the form of MTC’s to answer the questions of “where is the information moving to/from?” and “what information is being moved?”.

Our biggest remaining question is how to understand why models attend where they do. If we had a solution to that, we could combine it with our solutions above to form a complete picture of information movement in models.

Given this, our next steps split into two classes: understanding pattern formation, and further developing our results above to improve our circuit analyses:

We may also return to study attention replacement layers, likely through a lens that looks like a combination of QK diagonalization and MTC’s.