Week 2 of 12 · Part A — Applied Safety

Reading Real Safety Policies

Before you write a content policy, learn to read one — how a frontier lab turns "be safe" into a document

Day 6 ~60 minutes Concept

Day 6 of 60

Why a policy is the contract everything inherits

Last week you learned to threat-model: "what could go wrong, for whom?" This week you answer the question that immediately follows — "and what exactly do we do about it?" The answer is never a vibe. It's a document: a safety taxonomy that names the categories of harm, a severity scheme that ranks them, and a policy that says what the model should refuse, allow, or safely complete. Every filter, every eval, every human reviewer downstream is measured against that document. Authoring it well is one of the highest-leverage things a safety practitioner does.

The thesis

"Is this output harmful?" is meaningless until someone defines the categories, the tiers, and the edge rules. The policy is the contract. A label is only as good as the policy it inherited; a refusal is only as defensible as the rule behind it. Before you write one, you read the best ones in the world.

So today is deliberately a reading day. You're going to study two real, published policies from frontier labs and reverse-engineer their structure — because the fastest way to write a good taxonomy is to see exactly how the people who do this for a living shaped theirs.

Policy-as-document: the Model Spec move

The clearest example of policy-as-document is the OpenAI Model Spec. It's not a vague mission statement — it's a behavioral specification with a chain of command (platform rules > developer rules > user requests > defaults), explicit defaults, and worked examples of how the model should resolve conflicts. Crucially, it states how the model should trade helpfulness against safety, rather than pretending the two never collide.

Core Theory

1 · Hierarchy — whose instruction wins?

A real policy answers "what if the user asks for something the platform forbids?" up front. The Model Spec's chain of command makes the precedence explicit, so the model isn't improvising priority under pressure.

2 · Defaults — what happens when nothing says otherwise?

Most interactions aren't edge cases. A policy specifies the default posture (assume good intent, be helpful, ask for clarification) so the common path is defined, not accidental.

3 · The helpful↔harmless tension — stated, not hidden

The hardest part of any policy is the borderline. A mature spec says explicitly what to refuse, what to allow, and what to safe-complete (answer partially / with caveats) — because a policy that refuses everything borderline is as broken as one that allows real harm.

Read it as a writer, not a user

As you browse the Model Spec, keep asking: why did they phrase it this way? Notice where a rule is written to be actionable (a reviewer could apply it consistently) versus aspirational. That distinction is the whole craft — and it's exactly what you'll imitate next week.

From spec to category structure: the AUP

Where the Model Spec governs behavior, an acceptable-use policy governs permitted use — and it's where you'll see the category structure your own taxonomy will mirror. Read the Anthropic Usage Policy and pay attention to how each prohibited category is defined to be actionable: not "don't be harmful," but specific, decidable categories a reviewer can apply without re-litigating intent every time.

That actionability is the difference between a policy that scales to a team and one that lives only in the author's head. A good category definition is one where two trained reviewers, given the same item, reach the same verdict. That property — inter-rater agreement — is what you're really designing for.

The components every safety taxonomy needs

By the end of today you should be able to list them from memory: (1) named categories of harm, (2) a precise definition per category, (3) severity tiers, (4) worked examples (and ideally benign look-alikes), and (5) a routing/refusal rule for what to do when each one fires. You'll build all five this week.

Your work today

Read Two Real Policies + Pick Your Domain

~60 minutes

  1. Browse the OpenAI Model Spec — focus on the chain of command, the defaults, and any example where it resolves the helpful-vs-safe tension. Note one rule that is clearly written to be actionable.
  2. Read the Anthropic Usage Policy and list its prohibited-use categories. For two of them, write why the definition would (or wouldn't) let two reviewers agree.
  3. In a notebook, write down the five components every safety taxonomy needs, then pick the domain you'll write a taxonomy for this week (e.g. a general chat assistant, an image generator, a coding agent). You'll build on this choice every day.
The expert move

A practitioner reads a safety policy to learn the rules. An expert reads it to learn the design decisions: why is this category split from that one, why is the threshold here, why is this phrased to be decidable rather than merely correct? The altitude jump is from following a policy to being able to author and defend one — and the fastest path there is reverse-engineering the best published examples.

Say this in an interview: "I treat a content policy as the contract every downstream filter and reviewer inherits, so I study real specs — the OpenAI Model Spec's chain of command, a lab AUP's category structure — for their design decisions, not just their rules. The test I hold a category to is whether two trained reviewers would agree on it."

Today's Takeaways