Letting a weak judge supervise a hard question by making two strong models argue
Day 42 of 60
Yesterday's answer to scalable oversight was to move labeling onto a constitution. Debate makes a different wager: even if a human judge can't solve a hard question, maybe they can judge an argument about it — if a second, equally capable model is incentivized to expose every flaw in the first. The bet is that it's easier to point out a lie than to tell the truth, so adversarial pressure surfaces problems a lone judge would miss.
Two strong models argue opposite sides of a question; a weaker judge (a human, or a weaker model) decides who won. If the game is designed so that honesty is the winning strategy — telling the truth is easier to defend than a lie a sharp opponent will dismantle — then the judge can adjudicate questions far beyond their own ability to answer directly. Oversight scales because the burden shifts from answering to evaluating a contest.
The core conjecture: in the debate game, it is harder to lie convincingly than to refute a lie. A dishonest debater has to make their falsehood survive a motivated opponent who can zoom in on any weak step. If that asymmetry holds, the honest side wins, and a limited judge can trust the outcome.
Debaters drill into the single sub-claim they most disagree on, narrowing a sprawling question down to one cruxy point the judge can check. The judge never needs the whole answer — only to evaluate the decisive disagreement.
Debate assumes the asymmetry actually holds and that the judge isn't simply persuadable by rhetoric, length, or confident tone. If a judge can be manipulated, or if some true claims are genuinely harder to defend than to attack, the equilibrium tilts the wrong way. Debate is a proposal under active test, not a settled solution.
The clean way to frame all of this is empirical: does human + model beat human alone on questions the human can't answer unaided? That's the experimental shape of scalable oversight — and the bar any technique, debate included, has to clear before you trust it.
Constitutional AI bakes oversight into training via a written spec; debate puts oversight at evaluation time via an adversarial contest. They aren't rivals so much as different layers — and you'll formally compare them on Day 44. The thing to carry forward is that debate's strength is exactly its risk: it relies on an adversary to keep the system honest, which works beautifully when the asymmetry holds and fails quietly when a judge can be gamed.
A novice describes debate as "two AIs argue and a human picks." An expert states the load-bearing assumption and what voids it: debate only scales oversight if lying is harder to defend than refuting, and if the judge can't be won by rhetoric instead of truth. Naming the precondition — and the empirical bar of "human+model beats human alone" — is the altitude jump from describing a mechanism to evaluating whether to trust it.
Say this in an interview: "Debate is a scalable-oversight proposal: it lets a weaker judge supervise a harder question by making an adversary expose every flaw. I'd be precise that it rests on one assumption — that defending a lie is harder than refuting it — and that it fails if the judge is persuadable, which is exactly what 'human + model beats human alone' experiments are built to measure."