Turning a pile of alignment papers into a decision matrix you can defend
Day 44 of 60
You've now read five techniques: RLHF, RLAIF / Constitutional AI, debate, weak-to-strong, and — added today — process supervision. A practitioner doesn't recite them; they compare them on the axes that decide which to reach for. Today you build the comparison itself: a small decision matrix that scores each technique on what actually matters, so "which alignment approach?" becomes a defensible answer instead of a vibe.
The three axes that matter for any oversight technique: scalability (does it hold up as the model outgrows its supervisor?), low human cost (how much fallible, expensive human feedback does it need?), and deception-robustness (how well does it survive a model that's trying to game it?). No technique wins on all three — and seeing that clearly is the actual research frontier.
Outcome supervision rewards a correct final answer; process supervision rewards each correct step of the reasoning. On hard reasoning tasks, grading the steps produces more reliable models — and, crucially, it's more aligned: you're rewarding the model for reasoning the way you want, not for landing on the right answer by any route (including a deceptive one).
Outcome-only rewards are easy to game: a model can reach the right answer via a wrong or hidden process. Supervising the process makes that gaming harder to hide, because the steps themselves are graded. That's its comparative strength on the matrix — at a higher human/annotation cost.
In the Try This box is oversight_compare.py — it scores each technique 1–5 on scalability, low-human-cost, and deception-robustness, then sorts by total. Run it, read the ranking, and then argue with it: change a score and be ready to defend why. The output is less important than your ability to justify each number from what you read this week.
Add a fourth axis that matters to you — maturity / real-world adoption, or auditability — score every technique on it, and re-sort. Then write one sentence naming the technique you'd bet on for a specific deployment and why. The honors move (per the day's goal) is to notice that no row dominates, and say what that implies: you stack techniques rather than pick one.
oversight_compare.py from the Try This box and read the ranked matrix.A novice ranks the techniques and crowns a winner. An expert delivers the higher-altitude finding: no technique dominates every axis, so the real skill is choosing per context and stacking them — a constitution for cheap broad coverage, process supervision where deception is the threat, debate or weak-to-strong where the task outruns the supervisor. Owning a defensible scoring across scalability, human-cost, and gaming-robustness is what turns "I've read the papers" into "I can advise a deployment."
Say this in an interview: "I compare alignment techniques on three axes — scalability, human-feedback cost, and robustness to a model gaming the signal. The honest conclusion is that none dominates: RLAIF is cheap and scalable but weak to deception, process supervision is robust but costly, debate and weak-to-strong push past the supervisor's limits but rest on fragile assumptions. So I'd stack them by context rather than pick one."