Categories with precise definitions, severity tiers, and worked examples — the design craft, not the vibe
Day 7 of 60
A harm taxonomy is a structured set of categories of unsafe content, each with a precise definition, a severity tier, and enough worked examples that someone other than you can apply it consistently. It is not a list of bad words and it is not a feeling. It's the schema that makes "harmful" a decidable property instead of an argument. Today you design yours for the domain you picked yesterday.
A taxonomy is good not when its categories are complete but when they're mutually decidable: any single item lands in one obvious bucket, and two trained reviewers agree on which. Coverage matters, but agreement is the property that makes a taxonomy survive contact with a team.
The best way to learn the design moves is to study how real moderation systems were built. The paper A Holistic Approach to Undesired Content Detection (Markov et al., 2023) documents exactly this: how OpenAI defined its moderation categories, why the definitions are shaped the way they are, and the labeling decisions behind them. Read §2–3 with one question: what makes their category definitions work as instructions?
"Hate" is an adjective; a category needs a rule. A usable definition states what counts (e.g. content that demeans or incites against a protected group) and, just as importantly, what doesn't (e.g. neutral discussion of the concept of hate, or quoting it to condemn it). The exclusions are where agreement is won or lost.
A taxonomy without severity can't triage. Tiers are the point. The most serious categories (e.g. content abetting violent extremism or child safety) route to escalation and senior review; mid-tier categories route to confirmation and labeling; low-severity items may be allowed with a note. The tier, not your gut, decides what happens next.
Every category needs at least two positive examples and — the honors move — a benign look-alike: the thing that resembles the violation but isn't (a medical question that sounds like self-harm; security research that sounds like an attack). The look-alikes are how you prevent over-refusal before it starts.
You don't have to invent categories from nothing. Reference structures like the MLCommons AILuminate hazard taxonomy give you an industry-standard starting set of hazard categories. Borrow the structure, then specialize the definitions for your domain — a coding agent's "privacy" category looks different from an image generator's.
A taxonomy that only knows "violation / not violation" is half a policy. The other half is the response rule: for each category and tier, does the model refuse, allow, or safe-complete? Safe-completion is the underrated middle path — answering a borderline request partially, with caveats, or by addressing the legitimate need while declining the harmful part. A policy that can only refuse will over-refuse, and over-refusal is itself a failure (you'll measure it directly in Week 4).
The honors-tier move today is to write your tie-breaker rules for ambiguous cases before you hit them: "when an item could be category A or B, prefer the higher-severity one," or "when intent is unclear, default to safe-complete and flag for review." Edge cases are where policy is actually written — Week 1's Reflection Ritual applies here directly.
A practitioner writes categories that feel right. An expert writes categories that are decidable — and proves it by designing the exclusions and benign look-alikes that keep reviewers (and the model) from over-firing. The altitude jump is realizing that a taxonomy's quality lives in its boundaries, not its center: anyone can label the obvious cases, but the policy earns its keep on the look-alikes.
Say this in an interview: "When I author a harm taxonomy I design for inter-rater agreement, not just coverage. Every category gets a precise definition with explicit exclusions, a severity tier so we can triage, worked examples, and a benign look-alike — because over-refusal is a failure too, and the boundary cases are where the policy actually lives."