Week 12 of 12 · Part C — Governance

Assembling the Program

Bolting policy, red-team, and evals into one document that's internally consistent on a single definition of "safe enough"

Day 57 ~75 minutes Build

Day 57 of 60

From a table of contents to a spine

Yesterday you scoped: a subject, a falsifiable bar, a map of which artifact fills each section. Today you build the spine — the one document that names every part of the program and points to the artifact behind it. The deliverable is safety_program.py in the Try This box: a single dictionary where every line is a real thing you produced. Run it and it prints your whole program back to you as a contents page.

The thesis

The spine's value isn't the code — it's the forcing function. If a line can't point to an artifact, that part of your program doesn't exist yet. Assembling the spine turns "I think I covered everything" into a checklist you can fail, which is exactly what a review board will do to you.

The three Part-A sections, made consistent

Today's focus is the applied core: policy, red-team plan, and eval suite. Each already exists. The build work is making them agree — because assembled carelessly, they contradict each other, and a reviewer will find it.

Core Theory

1 · Policy section — the taxonomy and refusal rules (Week 2)

Drop in your taxonomy and refusal policy as the program's definition of harm. Everything downstream must use these category names and severity tiers. If your eval scores "harmful" using different categories than your policy defines, the program isn't internally consistent — fix the eval to speak the policy's language.

2 · Red-team section — the plan and coverage (Week 3)

Include the attack categories, success criteria, logging fields, and the coverage report from redteam_log.py. The plan must cover the same severity tiers your policy names — an attack class your policy calls high-severity but your red-team never tested is a visible hole.

3 · Eval section — the scorecard (Week 4)

Include safety_eval.py and its safe-refusal / harmful-compliance / over-refusal scorecard. Its pass bar must be the same "safe enough" number you pre-registered on Day 56. One definition of safe enough, used by the eval, referenced by the policy, targeted by the red-team — that's what "internally consistent" means.

One definition, three uses

The most common capstone failure is three sections that each quietly use a different bar. The policy says one thing is high-severity, the red-team prioritizes another, the eval's pass line is a third number. Pick one definition of "safe enough" from your scope and make all three sections cite it by reference, not restate it from memory.

Build the spine

Open the Try This box and run safety_program.py as written. Then replace every value with your artifact and a one-line description of where it lives. The output is your program's table of contents — and the moment a line reads "TODO" instead of a filename, you've found work to do before the binder is real.

Make it yours

Edit the PROGRAM dict so each line points to a file you actually have: threat_model.py, your taxonomy, redteam_log.py, safety_eval.py. Then add your top-3 failure modes with a mitigation and a monitoring trigger for each — the program isn't just "here's what I built," it's "here's what I'd watch in production."

Your work today

Assemble the Spine

~75 minutes

Run safety_program.py from the Try This box, then rewrite every value to point to one of your artifacts.
Assemble the policy section from your Week 2 taxonomy + refusal rules — these become the program's canonical category names and tiers.
Assemble the red-team section from your Week 3 plan + coverage report; confirm it covers the severity tiers your policy names.
Assemble the eval section from your Week 4 scorecard; set its pass bar to the exact "safe enough" number you pre-registered on Day 56.
Sweep for consistency: do all three sections use one definition of "safe enough"? Fix any that drifted. Then list your top-3 failure modes with mitigations and triggers.

The expert move

A junior hands over three documents and lets the reviewer reconcile them. An expert hands over one program whose sections already agree — same categories, same severity tiers, same bar — so the review is about the verdict, not about untangling contradictions. The altitude jump is from "I produced the parts" to "I made the parts cohere," which is the actual job of a lead.

Say this in an interview: "I assemble the program around a single definition of 'safe enough.' The policy defines the categories, the red-team tests those exact categories, and the eval's pass bar is the number I pre-registered — so the document is internally consistent and a reviewer spends their time on the decision, not on reconciling my sections."

Today's Takeaways

The spine (safety_program.py) is a forcing function: every line must point to a real artifact or it's a TODO.
Policy (W2), red-team (W3), and evals (W4) must share one definition of "safe enough" — categories, tiers, and pass bar.
An attack class your policy calls high-severity but your red-team never tested is a visible hole.
A program states not just what you built, but the failure modes you'd monitor in production.