Week 8 of 12 · Part B — Alignment Literacy

Sparse Autoencoders at Scale

Pulling superposed features apart into millions of monosemantic, steerable concepts

Day 39 ~65 minutes Concept

Day 39 of 60

The tool built for the hardest problem

Day 37 named the wall: superposition packs more features than there are neurons, so you can't read meaning off individual units. Day 38 showed the wall in miniature — a probe stops separating cleanly when features overlap. Today is the tool the field built to climb it: the sparse autoencoder (SAE). It's the move from a toy probe on a few hand-labeled examples to disentangling millions of features from a production model — and the result is the current state of the art in reading a frontier system's internals.

The thesis

An SAE learns to re-express a model's crowded activations as a much wider but sparse set of features — each activation reconstructed from just a handful of active units. Sparsity is the whole trick: forcing only a few features on at a time pushes the model to allocate one clean, monosemantic feature per concept instead of overloading neurons. It's an un-mixing machine for superposition.

What Scaling Monosemanticity actually showed

Core Theory

1 · Monosemantic features, from a real model

Applied to a production model (Claude 3 Sonnet), an SAE extracted millions of features that are genuinely interpretable — a feature that fires on the Golden Gate Bridge, features for code bugs, for specific people, for abstract concepts. Not toy data: the actual internals of a deployed system.

2 · Safety-relevant features exist and are steerable

Critically, some features are safety-relevant — features that track deception, sycophancy, dangerous content, and bias. And they're not just readable: clamping a feature on or off measurably steers the model's behavior. That's the Day 38 read-and-steer idea, realized at scale on real features.

3 · This is what "lie detector" would mean in practice

If a deception feature reliably fires when a model is being deceptive, you have something black-box testing never could: a window into the model's internal state, not just its output. That is the promise that makes interpretability the field's most ambitious safety bet.

The honest limits — say these out loud

Here is where credibility is won or lost. Interpretability is genuinely exciting and nowhere near a safety guarantee at frontier scale. A practitioner states both without flinching.

What interpretability cannot yet do

Coverage is incomplete. SAEs capture many features but not all; nobody can claim to have found every safety-relevant feature, so "no deception feature fired" is not proof of safety. It doesn't scale freely — extracting and validating features is expensive and lags the largest models. Features aren't perfectly clean — interpretation involves judgment, and a "deception" feature may be correlational rather than the true mechanism. And a sufficiently capable model could, in principle, learn representations that evade the tools. Interpretability today is a powerful research instrument, not a certified verifier.

The posture to hold

"Promising but not yet sufficient" is not a hedge — it's the correct, defensible position. Overselling interpretability ("we can now read minds, models are safe") is as much a failure as dismissing it. Your job is to convey real progress and real limits in the same breath. That balance is exactly what Day 40's Part B brief has to capture.

Your work today

Read the State of the Art — and Its Edges

~65 minutes

Browse Scaling Monosemanticity (Anthropic, 2024) and read the safety-relevant features section closely. Pick one specific feature (e.g., a deception or bias feature) and note what was shown about reading and steering it.
Write the two-column note that anchors your week: in your own words, what interpretability can do today, and what it cannot yet guarantee at frontier scale. Be specific — this goes straight into the Day 40 brief.
Connect back: in one sentence, explain why SAEs are the answer to the superposition problem you read about in Toy Models of Superposition.
Optional: if you're going hands-on, revisit Mechanistic Interpretability — Getting Started (Neel Nanda) for the open problems list.

The expert move

An enthusiast says "interpretability solved alignment — we can read deception now." An expert delivers progress and limits in one sentence: SAEs extract millions of monosemantic, steerable features including safety-relevant ones, and coverage is incomplete, validation is expensive, and a capable model could learn to evade the tools — so this is the best instrument we have, not a certified verifier. The altitude is holding genuine optimism and honest skepticism simultaneously.

Say this in an interview: "Sparse autoencoders are the real advance against superposition — Scaling Monosemanticity pulled millions of interpretable, steerable features out of a production model, including ones for deception and bias. But I'd never call it a safety guarantee: coverage is incomplete and a capable model could learn to evade the probes. It's the field's best lie-detector prototype, not a finished one."

Today's Takeaways

A sparse autoencoder re-expresses crowded activations as a wide, sparse set of monosemantic features — the answer to superposition.
Scaling Monosemanticity extracted millions of interpretable features from a production model, including steerable, safety-relevant ones (deception, bias).
The honest limits: incomplete coverage, high cost, imperfect features, and possible evasion — "no feature fired" is not proof of safety.
"Promising but not yet sufficient" is the credible position — convey progress and limits in the same breath.