Week 9 of 12 · Part B — Alignment Literacy

Comparing the Techniques

Turning a pile of alignment papers into a decision matrix you can defend

Day 44 ~70 minutes Build

Day 44 of 60

From survey to decision tool

You've now read five techniques: RLHF, RLAIF / Constitutional AI, debate, weak-to-strong, and — added today — process supervision. A practitioner doesn't recite them; they compare them on the axes that decide which to reach for. Today you build the comparison itself: a small decision matrix that scores each technique on what actually matters, so "which alignment approach?" becomes a defensible answer instead of a vibe.

The thesis

The three axes that matter for any oversight technique: scalability (does it hold up as the model outgrows its supervisor?), low human cost (how much fallible, expensive human feedback does it need?), and deception-robustness (how well does it survive a model that's trying to game it?). No technique wins on all three — and seeing that clearly is the actual research frontier.

The fifth technique: process supervision

Core Theory

Reward the reasoning, not just the answer

Outcome supervision rewards a correct final answer; process supervision rewards each correct step of the reasoning. On hard reasoning tasks, grading the steps produces more reliable models — and, crucially, it's more aligned: you're rewarding the model for reasoning the way you want, not for landing on the right answer by any route (including a deceptive one).

Why it scores high on deception-robustness

Outcome-only rewards are easy to game: a model can reach the right answer via a wrong or hidden process. Supervising the process makes that gaming harder to hide, because the steps themselves are graded. That's its comparative strength on the matrix — at a higher human/annotation cost.

Build it

In the Try This box is oversight_compare.py — it scores each technique 1–5 on scalability, low-human-cost, and deception-robustness, then sorts by total. Run it, read the ranking, and then argue with it: change a score and be ready to defend why. The output is less important than your ability to justify each number from what you read this week.

Make it yours

Add a fourth axis that matters to you — maturity / real-world adoption, or auditability — score every technique on it, and re-sort. Then write one sentence naming the technique you'd bet on for a specific deployment and why. The honors move (per the day's goal) is to notice that no row dominates, and say what that implies: you stack techniques rather than pick one.

Your work today

Build the Comparison

~70 minutes

Run oversight_compare.py from the Try This box and read the ranked matrix.
Read §1–3 of Let's Verify Step by Step (Lightman et al., 2023) — the process-supervision result — then justify its scores in the matrix.
Add a fourth axis, re-rank, and write one sentence on which technique you'd bet on for a concrete system and why. Confirm for yourself that no technique dominates every axis.

The expert move

A novice ranks the techniques and crowns a winner. An expert delivers the higher-altitude finding: no technique dominates every axis, so the real skill is choosing per context and stacking them — a constitution for cheap broad coverage, process supervision where deception is the threat, debate or weak-to-strong where the task outruns the supervisor. Owning a defensible scoring across scalability, human-cost, and gaming-robustness is what turns "I've read the papers" into "I can advise a deployment."

Say this in an interview: "I compare alignment techniques on three axes — scalability, human-feedback cost, and robustness to a model gaming the signal. The honest conclusion is that none dominates: RLAIF is cheap and scalable but weak to deception, process supervision is robust but costly, debate and weak-to-strong push past the supervisor's limits but rest on fragile assumptions. So I'd stack them by context rather than pick one."

Today's Takeaways

Compare oversight techniques on three axes: scalability, low human cost, deception-robustness.
Process supervision rewards correct reasoning steps, not just answers — harder to game, but costlier to label.
No technique dominates every axis; the skill is choosing per context and stacking them.
A defensible scoring is what turns "I read the papers" into "I can advise a deployment."