Week 8 of 12 · Part B — Alignment Literacy

Reading a Concept with a Probe

If a concept is a direction in activation space, you can read it — and steer it

Day 38 ~70 minutes Build

Day 38 of 60

From theory to a thing you can run

Two days of theory said the unit of meaning is a direction in activation space, hidden by superposition. Today you make that concrete with the smallest honest version of the idea: a linear probe. Given activation vectors labeled by whether they encode some concept — say "refusal" — you find the direction that best separates the labels, then measure how cleanly the concept reads off that direction. If it's separable, the concept is linearly readable, and the very same direction is what you'd push along to steer the model. Read and steer are two ends of one idea.

The thesis

A linear probe is the seed of all of feature analysis and activation steering. If a concept corresponds to a direction, then (1) you can read it — project an activation onto the direction and see how much of the concept is present — and (2) you can steer it — add the direction back into the activations to amplify the concept or subtract it to suppress it. The toy below is the 20-line version of the whole promise.

How the probe works

Core Theory

1 · A concept direction is just a difference of means

Take the activations labeled "concept present" and the ones labeled "absent." Subtract the average of the negatives from the average of the positives. That difference vector points from "no concept" toward "concept" — a crude but real concept direction.

2 · Reading = projecting onto the direction

To check whether a new activation encodes the concept, take its dot product with the direction. High projection → the concept is present. Pick a threshold and you have a classifier built entirely from the model's internals.

3 · Separability is the score that matters

If the projections cleanly split the two classes, the concept is linearly separable — it lives along a readable direction. Low separability means the concept either isn't there or is tangled up in superposition with everything else, which is exactly the limitation that motivates sparse autoencoders.

Build it

In the Try This box is feature_probe.py — a runnable, dependency-free probe. It takes toy "activations" labeled by whether they encode a concept, computes the concept direction as a difference of class means, projects every example onto it, and reports the concept's linear separability. Run it, then change the data and predict the separability before you re-run — that prediction is where the intuition gets built.

Make it yours

First make the two classes obviously distinct and confirm separability hits 100%. Then deliberately blur them — make a "positive" example that looks like a negative — and watch the score drop. That drop is superposition in miniature: when features overlap, the clean direction stops existing. Write one sentence on what you'd do next if this were a real model (hint: it's why SAEs exist).

Your work today

Probe a Concept, Then Break It

~70 minutes

Run feature_probe.py from the Try This box and read its output — confirm it reports the concept direction and the linear separability.
Edit the toy data: add ambiguous examples and predict the separability first, then re-run and check whether you were right. Repeat until your predictions are reliable.
Read how this scales up: in Scaling Monosemanticity (Anthropic, 2024), skim how real, safety-relevant features are not just read but steered by adding their direction to activations. Write one sentence connecting your toy probe to their feature steering.
Optional on-ramp: if you want to go hands-on beyond the toy, browse Mechanistic Interpretability — Getting Started (Neel Nanda) for the next concrete steps.

The expert move

A beginner treats a probe's accuracy as the result. An expert reads it as a claim about geometry: high separability means the concept is a linear direction you can both read and steer; low separability is itself a finding — the concept is entangled, which is the precise signature of superposition. The altitude jump is realizing that read and steer are the same operation, and that a probe's failure tells you why you need SAEs.

Say this in an interview: "A linear probe is the minimal version of feature analysis: if a concept is a direction in activation space, I can read it by projection and steer it by adding the direction back. When separability is poor, that's not a dead end — it's evidence the feature is in superposition, which is exactly what sparse autoencoders are built to disentangle."

Today's Takeaways

A linear probe finds a concept direction as a difference of class means and reads the concept by projection.
Separability is the score: high = the concept is a readable direction; low = it's tangled in superposition.
Reading and steering are one idea — the direction you read along is the direction you'd add to amplify or suppress the concept.
A probe that fails to separate is a finding, not a failure — it's the motivation for sparse autoencoders (Day 39).