Adding the deeper layers — robustness, alignment, the risk register, and the governance gaps — so the program is coherent across all three layers of the field
Day 58 of 60
Yesterday's spine covered the applied core: policy, red-team, evals. But this whole track moved through three layers — applied safety, alignment research literacy, and governance — and a program that only shows the applied layer reads as shallow to anyone senior. Today you bolt on the rest: the robustness report, the alignment and interpretability note, the risk register, and the governance gap list. When you're done, the binder spans all three layers and tells one continuous story.
Applied safety tells you whether the model fails today. Alignment literacy tells you why a more capable version might fail in ways your evals can't see. Governance tells you who's accountable and against what framework. A credible program answers all three — because a deployment decision that ignores any one of them is overconfident.
Drop in your robustness report and its defense-in-depth matrix. Its job in the program is honesty about brittleness: safety-tuning alone is breakable, and you list the attack classes (jailbreaks, indirect injection) plus the layered defenses and their residual attack-success rates. This is where the program admits what it can't fully stop, and says how it'll monitor for it.
Include your brief on deception, sycophancy, and the limits of interpretability. Its job is epistemic humility: behavioral evals can only catch what they probe, and a capable model can pass them while pursuing the wrong objective. You're not claiming to have solved alignment — you're showing you know which of your assurances are behavioral (and therefore bounded) versus mechanistic.
Include the risk register mapped to a recognized framework (NIST AI RMF) and the governance gap list from your compliance check. Their job is to turn findings into owned, tracked items against an external standard — so "we found risks" becomes "here are the residual risks, their owners, and the gaps we must close before or shortly after launch."
Before you look at your assembled results, write down what verdict each outcome implies: which findings force a NO-GO, which allow GO-with-conditions, which are acceptable residual risk. Deciding the decision rule after seeing results is how programs rationalize shipping. Pre-registering it is what makes your recommendation tomorrow defensible.
The trap at this stage is a binder that's complete but not coherent — eight sections that each make sense alone but don't connect. Coherence means the robustness report's residual attack-success rate shows up in the risk register; the alignment note's "we can't fully verify intent" caveats the eval section's pass; the governance gaps name owners who appear in the risk register. The reader should be able to trace one risk from threat model, through how it was tested, to its residual level and who owns it.
Pick your single highest risk. Can you trace it across the whole binder — named in the threat model, defined by the policy, tested by the red-team, measured by an eval, defended in the robustness report, caveated by the alignment note, logged in the risk register, and owned in the governance section? If any link is missing, that's today's last edit. Then quantify the residual risk after mitigations — what's left once your defenses are applied.
A junior shows the model passed the evals. An expert shows the evals' blind spots too — the brittleness in the robustness report, the limits of behavioral testing in the alignment note, the residual risks the register tracks — and still makes a call. The altitude jump is from "it passed" to "here's exactly how confident I am, why, and what I'm watching that could change my mind."
Say this in an interview: "My program spans all three layers — I don't just show the model passed today's evals, I show what those evals can't see: the brittleness, the limits of behavioral assurance, the residual risks mapped to a framework with owners. I pre-register the verdict criteria so the recommendation is a rule applied, not a result rationalized."