Week 2 of 12 · Part A — Applied Safety

Building a Harm Taxonomy

Categories with precise definitions, severity tiers, and worked examples — the design craft, not the vibe

Day 7 ~70 minutes Concept

Day 7 of 60

What a taxonomy actually is

A harm taxonomy is a structured set of categories of unsafe content, each with a precise definition, a severity tier, and enough worked examples that someone other than you can apply it consistently. It is not a list of bad words and it is not a feeling. It's the schema that makes "harmful" a decidable property instead of an argument. Today you design yours for the domain you picked yesterday.

The thesis

A taxonomy is good not when its categories are complete but when they're mutually decidable: any single item lands in one obvious bucket, and two trained reviewers agree on which. Coverage matters, but agreement is the property that makes a taxonomy survive contact with a team.

The best way to learn the design moves is to study how real moderation systems were built. The paper A Holistic Approach to Undesired Content Detection (Markov et al., 2023) documents exactly this: how OpenAI defined its moderation categories, why the definitions are shaped the way they are, and the labeling decisions behind them. Read §2–3 with one question: what makes their category definitions work as instructions?

The three design moves

Core Theory

1 · Precise definitions — write rules, not adjectives

"Hate" is an adjective; a category needs a rule. A usable definition states what counts (e.g. content that demeans or incites against a protected group) and, just as importantly, what doesn't (e.g. neutral discussion of the concept of hate, or quoting it to condemn it). The exclusions are where agreement is won or lost.

2 · Severity tiers — because not all violations are equal

A taxonomy without severity can't triage. Tiers are the point. The most serious categories (e.g. content abetting violent extremism or child safety) route to escalation and senior review; mid-tier categories route to confirmation and labeling; low-severity items may be allowed with a note. The tier, not your gut, decides what happens next.

3 · Worked examples + benign look-alikes

Every category needs at least two positive examples and — the honors move — a benign look-alike: the thing that resembles the violation but isn't (a medical question that sounds like self-harm; security research that sounds like an attack). The look-alikes are how you prevent over-refusal before it starts.

Mirror a real category structure

You don't have to invent categories from nothing. Reference structures like the MLCommons AILuminate hazard taxonomy give you an industry-standard starting set of hazard categories. Borrow the structure, then specialize the definitions for your domain — a coding agent's "privacy" category looks different from an image generator's.

The refusal-vs-safe-complete rule

A taxonomy that only knows "violation / not violation" is half a policy. The other half is the response rule: for each category and tier, does the model refuse, allow, or safe-complete? Safe-completion is the underrated middle path — answering a borderline request partially, with caveats, or by addressing the legitimate need while declining the harmful part. A policy that can only refuse will over-refuse, and over-refusal is itself a failure (you'll measure it directly in Week 4).

Write the tie-breakers now

The honors-tier move today is to write your tie-breaker rules for ambiguous cases before you hit them: "when an item could be category A or B, prefer the higher-severity one," or "when intent is unclear, default to safe-complete and flag for review." Edge cases are where policy is actually written — Week 1's Reflection Ritual applies here directly.

Your work today

Draft Your Taxonomy

~70 minutes

Read §2–3 of A Holistic Approach to Undesired Content Detection. Note two design decisions in how they defined categories that you'll borrow.
Browse the MLCommons AILuminate hazard categories and pick a starting set for your domain. Re-read the relevant categories of the Anthropic Usage Policy for definition phrasing.
Draft at least 5 categories. For each: a precise definition (with one exclusion), a severity tier, and 2+ worked examples. Add a benign look-alike to at least two of them.
Write your refusal-vs-safe-complete rule and your tie-breaker rules for ambiguous cases.

The expert move

A practitioner writes categories that feel right. An expert writes categories that are decidable — and proves it by designing the exclusions and benign look-alikes that keep reviewers (and the model) from over-firing. The altitude jump is realizing that a taxonomy's quality lives in its boundaries, not its center: anyone can label the obvious cases, but the policy earns its keep on the look-alikes.

Say this in an interview: "When I author a harm taxonomy I design for inter-rater agreement, not just coverage. Every category gets a precise definition with explicit exclusions, a severity tier so we can triage, worked examples, and a benign look-alike — because over-refusal is a failure too, and the boundary cases are where the policy actually lives."

Today's Takeaways

A taxonomy makes "harmful" decidable: categories, precise definitions, severity tiers, worked examples.
Write rules, not adjectives — and the exclusions matter as much as what counts.
Severity tiers are how a taxonomy triages; without them it can't route.
Pair refusal with safe-completion, and write tie-breakers before you hit the edge cases.