If a concept is a direction in activation space, you can read it — and steer it
Day 38 of 60
Two days of theory said the unit of meaning is a direction in activation space, hidden by superposition. Today you make that concrete with the smallest honest version of the idea: a linear probe. Given activation vectors labeled by whether they encode some concept — say "refusal" — you find the direction that best separates the labels, then measure how cleanly the concept reads off that direction. If it's separable, the concept is linearly readable, and the very same direction is what you'd push along to steer the model. Read and steer are two ends of one idea.
A linear probe is the seed of all of feature analysis and activation steering. If a concept corresponds to a direction, then (1) you can read it — project an activation onto the direction and see how much of the concept is present — and (2) you can steer it — add the direction back into the activations to amplify the concept or subtract it to suppress it. The toy below is the 20-line version of the whole promise.
Take the activations labeled "concept present" and the ones labeled "absent." Subtract the average of the negatives from the average of the positives. That difference vector points from "no concept" toward "concept" — a crude but real concept direction.
To check whether a new activation encodes the concept, take its dot product with the direction. High projection → the concept is present. Pick a threshold and you have a classifier built entirely from the model's internals.
If the projections cleanly split the two classes, the concept is linearly separable — it lives along a readable direction. Low separability means the concept either isn't there or is tangled up in superposition with everything else, which is exactly the limitation that motivates sparse autoencoders.
In the Try This box is feature_probe.py — a runnable, dependency-free probe. It takes toy "activations" labeled by whether they encode a concept, computes the concept direction as a difference of class means, projects every example onto it, and reports the concept's linear separability. Run it, then change the data and predict the separability before you re-run — that prediction is where the intuition gets built.
First make the two classes obviously distinct and confirm separability hits 100%. Then deliberately blur them — make a "positive" example that looks like a negative — and watch the score drop. That drop is superposition in miniature: when features overlap, the clean direction stops existing. Write one sentence on what you'd do next if this were a real model (hint: it's why SAEs exist).
feature_probe.py from the Try This box and read its output — confirm it reports the concept direction and the linear separability.data: add ambiguous examples and predict the separability first, then re-run and check whether you were right. Repeat until your predictions are reliable.A beginner treats a probe's accuracy as the result. An expert reads it as a claim about geometry: high separability means the concept is a linear direction you can both read and steer; low separability is itself a finding — the concept is entangled, which is the precise signature of superposition. The altitude jump is realizing that read and steer are the same operation, and that a probe's failure tells you why you need SAEs.
Say this in an interview: "A linear probe is the minimal version of feature analysis: if a concept is a direction in activation space, I can read it by projection and steer it by adding the direction back. When separability is poor, that's not a dead end — it's evidence the feature is in superposition, which is exactly what sparse autoencoders are built to disentangle."