Pulling superposed features apart into millions of monosemantic, steerable concepts
Day 39 of 60
Day 37 named the wall: superposition packs more features than there are neurons, so you can't read meaning off individual units. Day 38 showed the wall in miniature — a probe stops separating cleanly when features overlap. Today is the tool the field built to climb it: the sparse autoencoder (SAE). It's the move from a toy probe on a few hand-labeled examples to disentangling millions of features from a production model — and the result is the current state of the art in reading a frontier system's internals.
An SAE learns to re-express a model's crowded activations as a much wider but sparse set of features — each activation reconstructed from just a handful of active units. Sparsity is the whole trick: forcing only a few features on at a time pushes the model to allocate one clean, monosemantic feature per concept instead of overloading neurons. It's an un-mixing machine for superposition.
Applied to a production model (Claude 3 Sonnet), an SAE extracted millions of features that are genuinely interpretable — a feature that fires on the Golden Gate Bridge, features for code bugs, for specific people, for abstract concepts. Not toy data: the actual internals of a deployed system.
Critically, some features are safety-relevant — features that track deception, sycophancy, dangerous content, and bias. And they're not just readable: clamping a feature on or off measurably steers the model's behavior. That's the Day 38 read-and-steer idea, realized at scale on real features.
If a deception feature reliably fires when a model is being deceptive, you have something black-box testing never could: a window into the model's internal state, not just its output. That is the promise that makes interpretability the field's most ambitious safety bet.
Here is where credibility is won or lost. Interpretability is genuinely exciting and nowhere near a safety guarantee at frontier scale. A practitioner states both without flinching.
Coverage is incomplete. SAEs capture many features but not all; nobody can claim to have found every safety-relevant feature, so "no deception feature fired" is not proof of safety. It doesn't scale freely — extracting and validating features is expensive and lags the largest models. Features aren't perfectly clean — interpretation involves judgment, and a "deception" feature may be correlational rather than the true mechanism. And a sufficiently capable model could, in principle, learn representations that evade the tools. Interpretability today is a powerful research instrument, not a certified verifier.
"Promising but not yet sufficient" is not a hedge — it's the correct, defensible position. Overselling interpretability ("we can now read minds, models are safe") is as much a failure as dismissing it. Your job is to convey real progress and real limits in the same breath. That balance is exactly what Day 40's Part B brief has to capture.
An enthusiast says "interpretability solved alignment — we can read deception now." An expert delivers progress and limits in one sentence: SAEs extract millions of monosemantic, steerable features including safety-relevant ones, and coverage is incomplete, validation is expensive, and a capable model could learn to evade the tools — so this is the best instrument we have, not a certified verifier. The altitude is holding genuine optimism and honest skepticism simultaneously.
Say this in an interview: "Sparse autoencoders are the real advance against superposition — Scaling Monosemanticity pulled millions of interpretable, steerable features out of a production model, including ones for deception and bias. But I'd never call it a safety guarantee: coverage is incomplete and a capable model could learn to evade the probes. It's the field's best lie-detector prototype, not a finished one."