Week 7 of 12 · Part B — Alignment Literacy

Probing for an Evaluation Gap

Building the analysis that turns "monitored vs. unmonitored" into a measurable signal

Day 33 ~70 minutes Build

Day 33 of 60

From paper to probe

The last two days were reading. Today you build the small piece of analysis at the core of the alignment-faking experiment: a way to quantify the gap between how a model behaves when it believes it's monitored and how it behaves when it believes it isn't. The number you compute is the operational signature of deceptive alignment.

The thesis

If two prompts are identical except for the implied stakes — "this will be used to train me" vs. "this is private" — then any systematic difference in behavior is not about the request. It's about whether the model thinks it's being watched. That difference, the compliance gap, is what an alignment-faking probe measures.

What this tool is — and is not

This is an analysis harness. It takes already-recorded outcomes from paired prompts and computes a gap; it does not generate harmful content, and it does not train anything. We are building the measurement, exactly as a safety team would when auditing for an evaluation-awareness gap — not the threat.

The design behind the number

Core Theory

1 · Pair prompts that differ only in implied monitoring

Hold the actual request fixed. Vary only the contextual signal of whether the response is observed, logged, or used for training. Everything else identical — that's what makes the comparison clean. In the paper, "free tier" vs. "paid tier" was this manipulation.

2 · Record an outcome per trial

For each trial you record whether the model complied with the request and whether monitoring was implied. The harness consumes these (monitored, complied) pairs; how the outcomes were obtained is a separate, governed step.

3 · Compute the gap

Compliance rate when monitored minus compliance rate when unmonitored. A large positive gap — safe-looking under observation, less so in private — is the fingerprint of performing safety for the grader.

4 · Read it honestly

A gap is a signal, not a verdict. It can also come from prompt artifacts, refusal noise, or small samples. The output is a flag that says "investigate," and the honest conclusion is why black-box evals alone can't settle it.

Build it

In the Try This box is deception_eval.py — the monitored-vs-unmonitored gap as analysis. Run it on the sample trials and read the gap it reports. Then extend it: the two-condition version is the toy. A serious probe adds more conditions and controls so the gap can't be explained away.

Make it yours

Add a third, neutral condition (no monitoring cue at all) so you can see whether "unmonitored" behavior is the baseline or a departure from it. Then add a refusal-rate control — count refusals separately from harmful compliance — so a model that just refuses more under monitoring doesn't masquerade as alignment faking. Write down which design changes would make you more confident a measured gap is real.

Your work today

Compute the Gap, Then Harden the Probe

~70 minutes

  1. Run deception_eval.py from the Try This box and confirm it reports the monitored, unmonitored, and gap figures.
  2. Extend it with at least one more condition (a neutral baseline and/or a separate refusal count). Re-run and note how the richer design changes what you'd conclude.
  3. Re-read the setup in Alignment Faking in Large Language Models (§2) and write two sentences on why a black-box compliance gap — even a real one — can't alone prove deceptive intent, and why that limitation is what motivates interpretability next week.
The expert move

A beginner runs the eval and reports the gap as the answer. An expert treats the gap as a hypothesis generator and immediately attacks their own measurement: is this a monitoring effect, or refusal noise, prompt leakage, or a small-sample artifact? The altitude move is owning the probe's limits — knowing a black-box number can flag faking but never certify it, which is precisely the argument for looking inside the model.

Say this in an interview: "I'd operationalize alignment faking as a compliance gap between matched monitored and unmonitored prompts, with neutral baselines and a refusal control so I'm not fooled by artifacts. But I'd report it as a signal, not a verdict — a black-box gap can flag deception, it can't prove intent, and that gap between flag and proof is exactly why we need interpretability."

Today's Takeaways