When the system is smarter than its supervisor, you change who does the labeling
Day 41 of 60
Part B has been diagnosis: why capable optimizers can pursue the wrong goal, fake alignment, or hide intent. This week turns to the engineering answer. The hardest version of the question is the scalable oversight problem: if a system is better than you at the task, how do you supervise it? You can't grade an answer you can't fully check. Every technique this week is one bet on how to get reliable supervision signal without a human who can out-think the model.
RLHF — reinforcement learning from human feedback — aligns a model by having people rank outputs. That works until the labeling itself becomes the bottleneck: humans are slow, inconsistent, expensive, and eventually outmatched. Constitutional AI and RLAIF attack that bottleneck by moving most of the harm-labeling from humans to an AI critiquing against a written constitution. The supervision signal stops scaling with human hours.
Collect pairs of model outputs, have humans pick the better one, train a reward model on those preferences, then optimize the policy against it. The alignment quality is capped by the quality and quantity of human labels — and humans labeling harmful content is slow, traumatic, and inconsistent.
Replace (most of) the human preference labels with an AI model judging outputs against an explicit set of principles. The expensive human step becomes writing the principles once instead of labeling millions of outputs. This is the move that lets oversight scale.
CAI is the concrete recipe: a written constitution (a short list of principles) drives a self-critique-and-revise phase, then those AI-generated preferences train the reward model. The model learns to be harmless by being shown how to critique itself against the constitution — not by humans hand-labeling every harm.
RLHF moves the bottleneck onto human labelers. RLAIF moves it onto a written constitution and an AI critic. The win isn't "no humans" — it's that humans now spend their scarce attention writing and auditing principles, the highest-leverage place to put it, instead of grading individual outputs.
Cheaper labeling is the obvious benefit, but the safety argument is deeper. A constitution is an auditable, versioned artifact: you can read it, argue with it, and change one principle and re-derive behavior. Compare that to a reward model trained on a million private human judgments nobody can inspect. CAI makes the value specification legible — and legibility is exactly what every later governance and model-card claim depends on.
RLAIF doesn't escape the alignment problem — it relocates it. Now the AI critic's judgment and the constitution's wording carry the load, and a flaw in either propagates everywhere. "The constitution said so" is only as safe as the constitution. Keep this caveat; it's what separates someone who read the abstract from someone who understood it.
An enthusiast says "Constitutional AI means the AI trains itself." An expert names the real shift in altitude: CAI doesn't remove human judgment, it relocates it from labeling individual outputs to writing and auditing a small, legible constitution — moving the scarce human attention to the highest-leverage point and making the value spec inspectable. They also name the cost: the constitution and the AI critic now carry the alignment, so a flaw in either scales.
Say this in an interview: "RLAIF and Constitutional AI matter because they move oversight off human labeling and onto an auditable constitution — that's how supervision scales. But I'd flag that it relocates the alignment problem rather than solving it: the model is only as aligned as the principles and the AI critic enforcing them."