Week 8 of 12 · Part B — Alignment Literacy

Superposition & Why Interp Is Hard

Why a neuron isn't a concept — and why networks pack more features than they have dimensions

Day 37 ~60 minutes Concept

Day 37 of 60

The obvious idea that turns out to be wrong

The tempting picture of a neural network is one neuron = one concept: a "dog" neuron, a "refusal" neuron, a "French" neuron. If that were true, interpretability would be easy — just label every neuron. It isn't true, and understanding why is the single most important idea in the field. Real neurons are polysemantic: a single neuron fires for an unrelated grab-bag of things — say, academic citations, HTTP requests, and Korean text — with no clean meaning at all.

The thesis

Neurons are polysemantic because networks use superposition: they represent more features than they have neurons by encoding features as overlapping directions in activation space rather than as individual units. This is a feature, not a bug — it lets a model pack enormous representational capacity into limited width — and it is precisely what makes the model hard to read.

Superposition, concretely

Core Theory

Features are directions, not neurons

The right unit of meaning isn't a neuron; it's a feature — a direction in the high-dimensional activation space. "Refusal" might be a direction that several neurons jointly contribute to, not any single neuron you can point at.

Why the model overloads its neurons

Most features are rare — any given input activates only a few of the thousands a model has learned. Because they rarely co-occur, the network can safely share neurons across many features, storing far more directions than it has dimensions. The cost is interference, but if features are sparse, the interference is usually tolerable. The result: capacity goes up, legibility goes down.

The consequence for you

You cannot interpret a model by reading its neurons one at a time, because each neuron is a blurry superposition of many features. The clean, human-meaningful unit is hiding in a combination of neurons — which is exactly the disentangling problem the rest of the week (sparse autoencoders) is built to solve.

Read this back to yourself

Polysemanticity is the symptom (one neuron, many meanings). Superposition is the cause (more features than neurons, packed as overlapping directions). And the feature — a direction in activation space — is the real unit you actually want to read. Get these three words straight and the whole week clicks.

Why this is the wall interpretability hits

This is why interpretability is genuinely hard, not just tedious. If meaning lived neatly in neurons, decompiling a model would be a labeling exercise. Because meaning lives in superposed directions, you first have to recover the features before you can read them — and there are far more of them than there are neurons to inspect. Superposition is the reason the field needed a new tool, and it sets up tomorrow exactly: sparse autoencoders are the attempt to pull the overlapping features back apart into clean, monosemantic ones you can name and steer.

Carry this forward

When you read about SAEs tomorrow, hold this question in mind: "What problem is this solving?" The answer is always superposition. An SAE is a machine for un-mixing the model's overloaded activations into the underlying features — and that framing keeps you from treating SAEs as magic.

Your work today

Read the Source of the Hardness

~60 minutes

Read the intro of Toy Models of Superposition (Elhage et al., 2022) — and skim the key results figures. You're after the mechanism, not the math.
Write, in your own words, a one-sentence definition of superposition and a one-sentence definition of polysemanticity, and state which causes which.
Answer in writing: why can't you interpret this model by reading its neurons one at a time? Connect your answer explicitly to what sparse autoencoders are for — that connection is the bridge to Day 39.

The expert move

A novice points at a neuron and calls it the "refusal neuron." An expert knows that's almost never how it works: meaning lives in directions, not units, and superposition means the model deliberately packs more features than it has neurons. The altitude jump is from "label the neurons" to "recover the features" — understanding that the unit of interpretation has to be found, not read off.

Say this in an interview: "Interpretability is hard mainly because of superposition — networks represent more features than they have neurons, so neurons are polysemantic and meaning lives in overlapping directions. That's the precise reason we needed sparse autoencoders: to disentangle those superposed features into monosemantic ones we can actually read."

Today's Takeaways

One neuron ≠ one concept — neurons are polysemantic, firing for unrelated things.
Superposition is the cause: networks pack more features than neurons by encoding them as overlapping directions.
The real unit of meaning is a feature (a direction in activation space), which must be recovered, not read off a neuron.
Superposition is exactly the problem sparse autoencoders were built to solve — keep that link for Day 39.