Week 2 of 12 · Part A — Applied Safety

Testing & Versioning the Policy

A policy is never finished — it's tested against real samples, found wanting, and versioned with reasons

Day 9 ~70 minutes Concept

Day 9 of 60

A policy is a hypothesis until you test it

Yesterday you encoded your taxonomy. Today you try to break it. A taxonomy that has only ever seen the examples its author wrote is untested — it's a hypothesis, not a policy. Real policy quality comes from running it against samples you didn't design for, finding the cases where its definitions are ambiguous or its tiers disagree with your judgment, and fixing them deliberately. The edge cases are where the standard actually gets written.

The thesis

The value of a policy isn't in its first draft — it's in the loop: test on real-ish samples → find a definitional gap → fix it → record why. Every ambiguity you resolve makes the next thousand decisions more consistent. A policy that never reaches v2 was never actually stress-tested.

This is the Week 1 Reflection Ritual applied to your own document: when a sample is ambiguous, you don't guess silently — you document the ambiguity precisely, make a provisional call, and feed it back into the policy as a sharper definition or a new tie-breaker. Disagreement is data.

How to test a taxonomy honestly

Core Theory

1 · Use samples you didn't write

Author-written examples confirm what you already believe. Pull or construct real-ish samples — borderline cases, benign look-alikes, things from a different angle than you designed for. The AILuminate hazard categories are a good source of realistic case types to probe each of your categories against.

2 · Measure agreement, not just correctness

Have someone else (or your code, or a separate reading) apply your taxonomy to the same 10+ samples and compare. Where you disagree is a definitional gap — the definition admitted two readings. Each disagreement is a precise pointer to a sentence that needs tightening.

3 · Look for the two failure shapes

Definitional gaps come in two flavors: under-coverage (a real harm has no category, or falls between two) and over-firing (a benign item matches a category because the definition forgot an exclusion). The second is your over-refusal source — fix it at the policy level, before it ever reaches a model.

Why labs publish their taxonomies and revise them

The reason frontier policies like the OpenAI Model Spec are versioned, public documents is that no one gets the categories right the first time. They ship a version, observe where it's applied inconsistently, and revise — the same loop you're running on a smaller scale. Treating your policy as a living, versioned artifact is what a mature safety org does.

Versioning: the changelog is the artifact

When you fix a definitional gap, the fix itself is only half the value — the other half is the recorded reason. A policy changelog (v1 → v2, with what changed and why) is what lets a team trust the policy and lets a future reviewer understand why a borderline case is ruled the way it is. It's also your evidence: in an interview or a review, "I revised the self-harm definition in v2 to exclude clinical/medical discussion after it over-fired on three benign samples" is a far stronger claim than "I wrote a good policy."

Produce a v2 today

The goal for today is concrete: test on 10+ samples, find and fix at least two definitional gaps, and produce a v2 of your policy with a changelog entry for each fix. The honors move is having someone else apply your taxonomy and comparing results — their disagreements are the gaps you can't see yourself.

Your work today

Test, Fix, and Version

~70 minutes

Assemble 10+ real-ish samples to test your taxonomy — include borderline cases and benign look-alikes. Use the AILuminate hazard categories to make sure you probe every category you defined.
Run each sample through your taxonomy.py and your own judgment. Where they disagree, write down the definitional gap — which sentence admitted two readings.
Fix at least two gaps (tighten a definition, add an exclusion, or add a tie-breaker). Re-run the samples to confirm the fix.
Write a policy changelog: v1 → v2, each change with its reason. Skim how the OpenAI Model Spec frames revisions for the tone.

The expert move

A practitioner writes a policy and ships it. An expert treats the policy as a living artifact under test — and knows the disagreements it produces are not failures but the highest-signal data the policy will ever generate. The altitude jump is owning the revision loop: turning each ambiguous call into a tightened definition and a changelog entry, so one person's judgment compounds into a standard a whole team can apply consistently.

Say this in an interview: "I version my policies. I test the taxonomy on samples I didn't write, treat every reviewer disagreement as a definitional gap, and fix it in a v2 with a changelog that records why. A policy's quality isn't in its first draft — it's in how disciplined the revision loop is, because that loop is what makes thousands of downstream decisions consistent."

Today's Takeaways

An untested policy is a hypothesis; quality comes from running it on samples you didn't design for.
Reviewer disagreement = a definitional gap — a precise pointer to a sentence to tighten.
Watch both failure shapes: under-coverage and over-firing (the over-refusal source).
The changelog is the artifact: v1 → v2 with reasons is what makes a policy trustworthy and defensible.