Week 9 of 12 · Part B — Alignment Literacy

Constitutional AI & RLAIF

When the system is smarter than its supervisor, you change who does the labeling

Day 41 ~60 minutes Concept

Day 41 of 60

The problem this week is built on

Part B has been diagnosis: why capable optimizers can pursue the wrong goal, fake alignment, or hide intent. This week turns to the engineering answer. The hardest version of the question is the scalable oversight problem: if a system is better than you at the task, how do you supervise it? You can't grade an answer you can't fully check. Every technique this week is one bet on how to get reliable supervision signal without a human who can out-think the model.

The thesis

RLHF — reinforcement learning from human feedback — aligns a model by having people rank outputs. That works until the labeling itself becomes the bottleneck: humans are slow, inconsistent, expensive, and eventually outmatched. Constitutional AI and RLAIF attack that bottleneck by moving most of the harm-labeling from humans to an AI critiquing against a written constitution. The supervision signal stops scaling with human hours.

RLHF vs RLAIF — what actually changes

Core Theory

RLHF — humans in the labeling loop

Collect pairs of model outputs, have humans pick the better one, train a reward model on those preferences, then optimize the policy against it. The alignment quality is capped by the quality and quantity of human labels — and humans labeling harmful content is slow, traumatic, and inconsistent.

RLAIF — an AI provides the preference labels

Replace (most of) the human preference labels with an AI model judging outputs against an explicit set of principles. The expensive human step becomes writing the principles once instead of labeling millions of outputs. This is the move that lets oversight scale.

Constitutional AI — the principles are the spec

CAI is the concrete recipe: a written constitution (a short list of principles) drives a self-critique-and-revise phase, then those AI-generated preferences train the reward model. The model learns to be harmless by being shown how to critique itself against the constitution — not by humans hand-labeling every harm.

Say it back

RLHF moves the bottleneck onto human labelers. RLAIF moves it onto a written constitution and an AI critic. The win isn't "no humans" — it's that humans now spend their scarce attention writing and auditing principles, the highest-leverage place to put it, instead of grading individual outputs.

Why this is a safety move, not just an efficiency one

Cheaper labeling is the obvious benefit, but the safety argument is deeper. A constitution is an auditable, versioned artifact: you can read it, argue with it, and change one principle and re-derive behavior. Compare that to a reward model trained on a million private human judgments nobody can inspect. CAI makes the value specification legible — and legibility is exactly what every later governance and model-card claim depends on.

The honest limit

RLAIF doesn't escape the alignment problem — it relocates it. Now the AI critic's judgment and the constitution's wording carry the load, and a flaw in either propagates everywhere. "The constitution said so" is only as safe as the constitution. Keep this caveat; it's what separates someone who read the abstract from someone who understood it.

Your work today

Read the Method

~60 minutes

Read the method section of Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) — focus on the critique-and-revise loop and how AI feedback replaces human harm-labeling.
Skim the constitution principles in the appendix. Note how short and human-readable the spec is — and write one principle you'd add.
In a notebook, write two sentences: one distinguishing RLHF from RLAIF, and one defining "scalable oversight" in your own words.

The expert move

An enthusiast says "Constitutional AI means the AI trains itself." An expert names the real shift in altitude: CAI doesn't remove human judgment, it relocates it from labeling individual outputs to writing and auditing a small, legible constitution — moving the scarce human attention to the highest-leverage point and making the value spec inspectable. They also name the cost: the constitution and the AI critic now carry the alignment, so a flaw in either scales.

Say this in an interview: "RLAIF and Constitutional AI matter because they move oversight off human labeling and onto an auditable constitution — that's how supervision scales. But I'd flag that it relocates the alignment problem rather than solving it: the model is only as aligned as the principles and the AI critic enforcing them."

Today's Takeaways

Scalable oversight: how do you supervise a system better than you at the task? It's the week's organizing problem.
RLHF is bottlenecked on human labelers; RLAIF moves the bottleneck onto a written constitution + AI critic.
Constitutional AI's win is legibility: an auditable, versioned spec instead of opaque human preferences.
It relocates the alignment problem, it doesn't dissolve it — the constitution and critic now carry the load.