Why a neuron isn't a concept — and why networks pack more features than they have dimensions
Day 37 of 60
The tempting picture of a neural network is one neuron = one concept: a "dog" neuron, a "refusal" neuron, a "French" neuron. If that were true, interpretability would be easy — just label every neuron. It isn't true, and understanding why is the single most important idea in the field. Real neurons are polysemantic: a single neuron fires for an unrelated grab-bag of things — say, academic citations, HTTP requests, and Korean text — with no clean meaning at all.
Neurons are polysemantic because networks use superposition: they represent more features than they have neurons by encoding features as overlapping directions in activation space rather than as individual units. This is a feature, not a bug — it lets a model pack enormous representational capacity into limited width — and it is precisely what makes the model hard to read.
The right unit of meaning isn't a neuron; it's a feature — a direction in the high-dimensional activation space. "Refusal" might be a direction that several neurons jointly contribute to, not any single neuron you can point at.
Most features are rare — any given input activates only a few of the thousands a model has learned. Because they rarely co-occur, the network can safely share neurons across many features, storing far more directions than it has dimensions. The cost is interference, but if features are sparse, the interference is usually tolerable. The result: capacity goes up, legibility goes down.
You cannot interpret a model by reading its neurons one at a time, because each neuron is a blurry superposition of many features. The clean, human-meaningful unit is hiding in a combination of neurons — which is exactly the disentangling problem the rest of the week (sparse autoencoders) is built to solve.
Polysemanticity is the symptom (one neuron, many meanings). Superposition is the cause (more features than neurons, packed as overlapping directions). And the feature — a direction in activation space — is the real unit you actually want to read. Get these three words straight and the whole week clicks.
This is why interpretability is genuinely hard, not just tedious. If meaning lived neatly in neurons, decompiling a model would be a labeling exercise. Because meaning lives in superposed directions, you first have to recover the features before you can read them — and there are far more of them than there are neurons to inspect. Superposition is the reason the field needed a new tool, and it sets up tomorrow exactly: sparse autoencoders are the attempt to pull the overlapping features back apart into clean, monosemantic ones you can name and steer.
When you read about SAEs tomorrow, hold this question in mind: "What problem is this solving?" The answer is always superposition. An SAE is a machine for un-mixing the model's overloaded activations into the underlying features — and that framing keeps you from treating SAEs as magic.
A novice points at a neuron and calls it the "refusal neuron." An expert knows that's almost never how it works: meaning lives in directions, not units, and superposition means the model deliberately packs more features than it has neurons. The altitude jump is from "label the neurons" to "recover the features" — understanding that the unit of interpretation has to be found, not read off.
Say this in an interview: "Interpretability is hard mainly because of superposition — networks represent more features than they have neurons, so neurons are polysemantic and meaning lives in overlapping directions. That's the precise reason we needed sparse autoencoders: to disentangle those superposed features into monosemantic ones we can actually read."