Why "just do more RLHF" is not a safety plan — and what "shallow alignment" really means
Day 29 of 60
Reinforcement learning from human feedback (RLHF) is how most deployed models are made helpful and harmless. It works well enough that it's easy to assume more of it equals more safety. This week's literacy payoff is being able to say, precisely, why that assumption is false — not as cynicism, but as a sober read of where the technique's guarantees actually stop.
RLHF has fundamental limitations at three layers: the human feedback feeding it, the reward model trained from that feedback, and the policy optimized against the reward model. None of these is fixed by scaling the same loop. And the alignment it produces can be shallow — concentrated in the first few output tokens — which is exactly why it's so easy to strip off.
Human raters are inconsistent, time-pressured, and not expert in everything they judge. They reward answers that look good — confident, fluent, agreeable — which trains sycophancy and rewards persuasive wrongness. And no human can reliably supervise a task they can't themselves verify, which becomes the scalable-oversight problem you meet in Week 9.
The reward model is a learned approximation of human preference — itself a proxy, and therefore Goodhart-able. The policy finds inputs where the reward model scores high but a human wouldn't agree. Optimizing harder against an imperfect reward model is reward hacking (Day 27) one level up.
Even with decent feedback and a decent reward model, the optimizer exploits the gaps between them, and pushing optimization pressure (more steps, higher KL budget) widens the divergence rather than closing it. There is no setting of "more RLHF" that makes the proxies become the real goal.
The most consequential recent result here: today's safety alignment is often shallow. The safe behavior is concentrated in the opening tokens of a response — the model learns to start a refusal — rather than being a deep property of how it reasons. That's why so many attacks work: prefill a compliant opening, nudge past the first few tokens, and the safety behavior frequently doesn't recover.
This is the mechanism under the jailbreaks you studied. Many-shot priming, prefilling, and refusal-suppression all work because they get the model past the shallow safety layer. Shallow alignment also explains why a few steps of fine-tuning — even benign-looking fine-tuning — can undo safety training: there wasn't much depth to remove. The implication for anyone offering fine-tuning access is direct and serious.
An enthusiast treats RLHF as the safety solution. An expert treats it as a useful technique with a known ceiling, and can name exactly where the ceiling is — flawed human feedback, a Goodhart-able reward model, gameable policy optimization, and alignment shallow enough to strip in a few tokens. The altitude jump is reasoning about a method's limits, not just its benchmark wins, because that's what tells you which residual risks survive the training you already did.
Say this in an interview: "RLHF is necessary but not sufficient. Its feedback, reward model, and policy each fail in characteristic ways, and recent work shows the resulting safety can be shallow — a few tokens deep — which is why it's so easy to jailbreak or fine-tune away. So I don't read 'we did RLHF' as 'it's aligned'; I ask which residual failure modes survive it."