Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases

This essay is a very informal discussion of some of the intuitions that motivate me in personally working on mechanistic interpretability. If you found it interesting and want to learn more about this line of research, check out the original Circuits thread and the new Transformer Circuits thread.

Random Related Topics FAQ^[a]

Doesn't having interpretable neurons go against having a distributed representation?

That isn't how we understand the term "distributed representation" in the context of Deep Learning. For example, a word embedding might have dimensions corresponding to interpretable features such as singular-plural, male-female, and large-small. One could rotate it so that those are basis dimensions. The thing that makes it "distributed" is that independent aspects of words are represented by different directions in representation space.

What do you mean by "feature"?

Defining a "feature" in a satisfying way is surprisingly hard. This isn't necessarily a bad thing. Sometimes when you're trying to understand something, you need to struggle with lots of definitions. (Lakatos' famous Proofs and Refutations is a beautiful articulation of this.)

The easy answer would be to say a feature is something like curve detector or car detector neurons found in InceptionV1 – some meaningful, articulable property of the input which the network encodes as a direction in activation space. The unsatisfactory thing about this definition is that it's human centric: it excludes the possibility of features humans don't understand.

An alternative definition, which I find more satisfactory, is to say that features are whatever the "independent units" a neural network representation can be decomposed into are. That definition is a bit fuzzier, but I suspect it might be getting at something more fundamental. Another idea for a potential human-independent definition is that features are the things a network would ideally dedicate a neuron to if you gave it enough neurons.

Some researchers define features simply as any function of the input. The thing I find unsatisfactory about this is that under this definition, any mixture of features is also a feature. It seems to me that there's something importantly different about a "car detector" (which seems to correspond to an important latent variable) from "car and cat detector" (which seems to be an arbitrary mixture of things). I'd like for a definition of a feature to make this distinction.

A lot of people seem to think that modularity is the key thing for understanding neural networks, but the circuits work keeps focusing on whether neurons can be understood. Why the difference?

The brief answer is that both seem very helpful – and we're hopeful they can be linked, as in branch specialization – but it seems to us that being able to decompose into understandable features is what makes mechanistic interpretability possible at all while modularity may make it easier once it is possible.

The intuitive focus on modularity seems very natural. In day-to-day programming, modularity is often the difference between code being easy or hard to understand. Another intuition is that the only way we'll be able to understand neural networks is if we can break them down into small enough chunks to understand, which seems like a kind of modularity. However, while modularity helps in day to day programming, we think it isn't the thing that makes it possible to understand programs at all. Understanding a computer program written as one giant function might be difficult, but trying to understand a program where memory isn't broken down into variables with coherent meanings (let alone variable names) seems terrifying!

In some ways, interpretable features create the ultimate kind of modularity: the network can be broken down into individual parameters, since each parameter can be thought of independently as connecting two features.

Publication

Acknowledgments

The ideas in this note are based on many conversations with others, especially Nelson Elhage, Catherine Olsson, Tristan Hume, Dario Amodei, Chelsea Voss, Nick Cammarata, Gabriel Goh, Neel Nanda, Martin Wattenberg, Shan Carter, and Ludwig Schubert. The note itself was greatly improved by the feedback of Nelson Elhage and Martin Wattenberg.

Regular Computer Programs	Neural Networks
Reverse Engineering	Mechanistic Interpretability
Program Binary	Network Parameters
VM / Processor / Interpreter	Network Architecture
Program State / Memory	Layer Representation / Activations
Variable / Memory Location	Neuron / Feature Direction

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases

Attacking the Curse of Dimensionality

Variables & Activations

Simple Memory Layout & Neurons

Learn More

Random Related Topics FAQ^[a]

Publication

Acknowledgments

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases

Attacking the Curse of Dimensionality

Variables & Activations

Simple Memory Layout & Neurons

Learn More

Random Related Topics FAQ[a]

Publication

Acknowledgments

Random Related Topics FAQ^[a]