Week 9 of 12 · Part B — Alignment Literacy

Debate as Oversight

Letting a weak judge supervise a hard question by making two strong models argue

Day 42 ~60 minutes Concept

Day 42 of 60

A different bet on the same problem

Yesterday's answer to scalable oversight was to move labeling onto a constitution. Debate makes a different wager: even if a human judge can't solve a hard question, maybe they can judge an argument about it — if a second, equally capable model is incentivized to expose every flaw in the first. The bet is that it's easier to point out a lie than to tell the truth, so adversarial pressure surfaces problems a lone judge would miss.

The thesis

Two strong models argue opposite sides of a question; a weaker judge (a human, or a weaker model) decides who won. If the game is designed so that honesty is the winning strategy — telling the truth is easier to defend than a lie a sharp opponent will dismantle — then the judge can adjudicate questions far beyond their own ability to answer directly. Oversight scales because the burden shifts from answering to evaluating a contest.

How debate is supposed to work

Core Theory

The asymmetry it exploits

The core conjecture: in the debate game, it is harder to lie convincingly than to refute a lie. A dishonest debater has to make their falsehood survive a motivated opponent who can zoom in on any weak step. If that asymmetry holds, the honest side wins, and a limited judge can trust the outcome.

Recursive decomposition

Debaters drill into the single sub-claim they most disagree on, narrowing a sprawling question down to one cruxy point the judge can check. The judge never needs the whole answer — only to evaluate the decisive disagreement.

Where it can break

Debate assumes the asymmetry actually holds and that the judge isn't simply persuadable by rhetoric, length, or confident tone. If a judge can be manipulated, or if some true claims are genuinely harder to defend than to attack, the equilibrium tilts the wrong way. Debate is a proposal under active test, not a settled solution.

Make it measurable

The clean way to frame all of this is empirical: does human + model beat human alone on questions the human can't answer unaided? That's the experimental shape of scalable oversight — and the bar any technique, debate included, has to clear before you trust it.

Debate vs the constitution approach

Constitutional AI bakes oversight into training via a written spec; debate puts oversight at evaluation time via an adversarial contest. They aren't rivals so much as different layers — and you'll formally compare them on Day 44. The thing to carry forward is that debate's strength is exactly its risk: it relies on an adversary to keep the system honest, which works beautifully when the asymmetry holds and fails quietly when a judge can be gamed.

Your work today

Read the Proposal

~60 minutes

  1. Read §1–2 of AI Safety via Debate (Irving et al., 2018) — focus on why honesty is meant to be the winning strategy and where that assumption is fragile.
  2. Skim the setup of Measuring Progress on Scalable Oversight (Bowman et al., 2022) — note the "human + model beats human alone" framing that makes oversight testable.
  3. Write one situation where debate would clearly help, and one where it would clearly fail (a manipulable judge, or a truth harder to defend than to attack).
The expert move

A novice describes debate as "two AIs argue and a human picks." An expert states the load-bearing assumption and what voids it: debate only scales oversight if lying is harder to defend than refuting, and if the judge can't be won by rhetoric instead of truth. Naming the precondition — and the empirical bar of "human+model beats human alone" — is the altitude jump from describing a mechanism to evaluating whether to trust it.

Say this in an interview: "Debate is a scalable-oversight proposal: it lets a weaker judge supervise a harder question by making an adversary expose every flaw. I'd be precise that it rests on one assumption — that defending a lie is harder than refuting it — and that it fails if the judge is persuadable, which is exactly what 'human + model beats human alone' experiments are built to measure."

Today's Takeaways