Week 9 of 12 · Part B — Alignment Literacy

Weak-to-Strong Generalization

If our future supervisors are weaker than the models, can a weak teacher still elicit a strong student?

Day 43 ~65 minutes Concept

Day 43 of 60

The superalignment framing

Constitutional AI and debate try to build a stronger supervision signal. Weak-to-strong asks the question from the other end: in the future, humans will be the weak supervisors of models that are stronger than us. So the empirical question is — can a weak teacher's imperfect labels still call forth the full capability of a stronger student, or does the student just learn to imitate the teacher's mistakes? That's the testable core of the superalignment problem.

The thesis

The clever move is to study the future problem with today's models: use a weak model to supervise a strong model, as an analogy for humans supervising superhuman systems. If the strong student can be made to generalize beyond the weak supervisor's errors — to do better than its own teacher — then weak human oversight of strong models isn't hopeless. The result is partial and fragile, which is exactly why it's a research frontier and not a victory lap.

What the setup actually shows

Core Theory

The analogy: weak supervisor, strong student

Fine-tune a strong pretrained model using labels generated by a much weaker model. The weak labels are noisy and wrong in places — a stand-in for fallible human supervision of a system we can't fully check.

The hopeful result

The strong student often recovers much of its latent capability and generalizes past the weak teacher's mistakes, rather than perfectly mimicking them. The pretrained model already "knows" more than the weak labels convey; good supervision is partly about eliciting that latent knowledge, not installing it.

The sober caveat

It only recovers part of the gap, the effect depends heavily on method and task, and "elicit the truth the model already represents" is far from solved. It's encouraging evidence that weak supervision isn't doomed — not a guarantee it works for the cases that matter most.

The reframe to keep

Alignment is partly an elicitation problem. A strong model may already represent the correct answer internally; the challenge is getting weak supervision to draw it out instead of overwriting it with the supervisor's errors. That reframing connects directly back to interpretability (Week 8): if you could read the model's internals, eliciting what it knows would be far easier.

How it sits next to debate and CAI

Debate manufactures a better signal through adversarial structure; CAI manufactures one through a written constitution; weak-to-strong asks whether the model's own latent capability can cover for a weak signal. They're complementary answers to one question — and none of them dominates, which is the honest state of the field. Tomorrow you'll put all of them in one matrix and score the tradeoffs explicitly.

Your work today

Read the Setup + Results

~65 minutes

  1. Read §1 and the results of Weak-to-Strong Generalization (Burns et al., 2023) — focus on the weak-supervisor / strong-student setup and exactly what was and wasn't recovered.
  2. Write two sentences: one stating the result plainly, one stating its strongest limitation honestly.
  3. In your Part B notes, contrast weak-to-strong with debate: one tries to elicit latent capability, the other tries to structure a better signal. When would you reach for each?
The expert move

A reader who skimmed the headline says "weak-to-strong proves weak humans can align strong AI." An expert holds both altitudes at once: it's encouraging evidence that supervision can generalize beyond a teacher's errors and a partial, method-dependent result that leaves elicitation unsolved. The jump is from reporting a finding to framing alignment as an elicitation problem — and stating the limits as crisply as the promise.

Say this in an interview: "Weak-to-strong generalization reframes alignment as elicitation: the analogy of a weak model supervising a strong one shows the student can generalize past the teacher's mistakes — which is real hope for weak human oversight of strong models. But it only recovers part of the gap and depends on method and task, so I'd treat it as evidence, not a solution."

Today's Takeaways