Bolting policy, red-team, and evals into one document that's internally consistent on a single definition of "safe enough"
Day 57 of 60
Yesterday you scoped: a subject, a falsifiable bar, a map of which artifact fills each section. Today you build the spine — the one document that names every part of the program and points to the artifact behind it. The deliverable is safety_program.py in the Try This box: a single dictionary where every line is a real thing you produced. Run it and it prints your whole program back to you as a contents page.
The spine's value isn't the code — it's the forcing function. If a line can't point to an artifact, that part of your program doesn't exist yet. Assembling the spine turns "I think I covered everything" into a checklist you can fail, which is exactly what a review board will do to you.
Today's focus is the applied core: policy, red-team plan, and eval suite. Each already exists. The build work is making them agree — because assembled carelessly, they contradict each other, and a reviewer will find it.
Drop in your taxonomy and refusal policy as the program's definition of harm. Everything downstream must use these category names and severity tiers. If your eval scores "harmful" using different categories than your policy defines, the program isn't internally consistent — fix the eval to speak the policy's language.
Include the attack categories, success criteria, logging fields, and the coverage report from redteam_log.py. The plan must cover the same severity tiers your policy names — an attack class your policy calls high-severity but your red-team never tested is a visible hole.
Include safety_eval.py and its safe-refusal / harmful-compliance / over-refusal scorecard. Its pass bar must be the same "safe enough" number you pre-registered on Day 56. One definition of safe enough, used by the eval, referenced by the policy, targeted by the red-team — that's what "internally consistent" means.
The most common capstone failure is three sections that each quietly use a different bar. The policy says one thing is high-severity, the red-team prioritizes another, the eval's pass line is a third number. Pick one definition of "safe enough" from your scope and make all three sections cite it by reference, not restate it from memory.
Open the Try This box and run safety_program.py as written. Then replace every value with your artifact and a one-line description of where it lives. The output is your program's table of contents — and the moment a line reads "TODO" instead of a filename, you've found work to do before the binder is real.
Edit the PROGRAM dict so each line points to a file you actually have: threat_model.py, your taxonomy, redteam_log.py, safety_eval.py. Then add your top-3 failure modes with a mitigation and a monitoring trigger for each — the program isn't just "here's what I built," it's "here's what I'd watch in production."
safety_program.py from the Try This box, then rewrite every value to point to one of your artifacts.A junior hands over three documents and lets the reviewer reconcile them. An expert hands over one program whose sections already agree — same categories, same severity tiers, same bar — so the review is about the verdict, not about untangling contradictions. The altitude jump is from "I produced the parts" to "I made the parts cohere," which is the actual job of a lead.
Say this in an interview: "I assemble the program around a single definition of 'safe enough.' The policy defines the categories, the red-team tests those exact categories, and the eval's pass bar is the number I pre-registered — so the document is internally consistent and a reviewer spends their time on the decision, not on reconciling my sections."
safety_program.py) is a forcing function: every line must point to a real artifact or it's a TODO.