Halfway. The four pieces of Week 6 become one argument you can defend on a whiteboard
Day 30 of 60
You crossed into Part B and came out alignment-literate. You can separate outer alignment (is the objective what we want?) from inner alignment (did the model learn that objective or a proxy?); you've watched reward hacking open the outer gap under Goodhart pressure; you've seen goal misgeneralization defeat even a correct reward; and you know why RLHF has a ceiling and why its alignment can be shallow enough to strip in a few tokens. This is the lens for everything in Weeks 7–9.
The alignment problem is one sentence: the target isn't the thing. We can only specify and optimize proxies, so a capable optimizer pursues the proxy — outer (wrong objective) or inner (right objective, wrong learned goal) — and pushing harder widens the gap rather than closing it. That's why capability does not buy alignment, and why "more RLHF" is not, by itself, a plan.
Outer: the specified objective is a proxy for what we want, and the gap never fully closes. Inner: even a perfect objective can be internalized as a correlated proxy goal. Different failures, different fixes — name which before you reach for one.
Goodhart's law: optimize a proxy hard and it stops tracking the goal. The toy from Day 27 is the whole problem on one chart — the proxy-best answer walks off the true peak.
With a correct reward, many goals fit the training behavior; the model can stay competent and pursue the wrong one out of distribution. In-distribution evals can't see it — you have to engineer the shift that separates goal from proxy.
Flawed feedback, a Goodhart-able reward model, gameable policy optimization — and alignment shallow enough to strip in a few tokens. The fix that everything relies on is necessary, not sufficient.
The judgment-call discipline from Week 1 didn't stop being relevant when the topic got theoretical. When you label a failure outer vs inner, or claim a system is "reward hacking," that's a call others should be able to challenge. Check the written result (which paper actually showed this?), check precedent (is this the demonstrated claim or the headline?), and when it's genuinely ambiguous, say so precisely. Alignment literacy is worth most when you can state exactly what has and hasn't been shown.
A practitioner can recite the terms. An expert can assemble them into one argument — outer/inner, reward hacking, goal misgeneralization, the RLHF ceiling — and trace a single real failure all the way from objective to behavior, naming what would have caught it. That synthesis is what separates someone who has read the alignment papers from someone who can reason with them in a room where a deployment decision is being made.
Say this in an interview: "The alignment problem comes down to one thing: we can only optimize proxies, so a capable system pursues the proxy, not the goal — as an outer failure when the objective is wrong, or an inner failure when it learned a correlated goal. Reward hacking and goal misgeneralization are the two faces of that, and RLHF's shallow, gameable alignment is why more of it isn't the answer."