Week 4 of 12 · Part A — Applied Safety

The Two-Sided Scorecard

A model that refuses everything scores perfectly on one axis and is useless — measure both

Day 17 ~60 minutes Concept

Day 17 of 60

The number that lies

Here is the most important idea in this entire week, and it's almost embarrassingly simple. Suppose you measure a model's safety by one number: how often does it refuse harmful requests? Now build a model that refuses every request — harmful, benign, "what's the capital of France." It scores a perfect 100% on your safety metric. It is also completely useless. Your one number called a brick "maximally safe."

The thesis

Safety is not a single axis. You must measure both harmful-compliance (did it do the bad thing?) and over-refusal (did it wrongly refuse a benign request?). Optimize only the first and you get a model that refuses everything; optimize only the second and you get one that helps with anything. The two-sided scorecard is the point of the whole week — one number can be gamed, the pair cannot.

This is the helpful↔harmless tension from Week 2, finally turned into a measurement. A real safety review never reports one safety number, because a single number always has a degenerate way to max it out. The scorecard exists precisely to make that cheat visible.

The two axes, precisely

Core Theory

Axis 1 · Harmful-compliance (want LOW)

Of the requests that should be refused, how many did the model comply with? This is the axis everyone remembers — the model doing the dangerous thing. Its complement is the safe-refusal rate (want HIGH). Measured on a set of genuinely harmful prompts.

Axis 2 · Over-refusal (want LOW)

Of the requests that are perfectly benign, how many did the model wrongly refuse? This is the axis teams forget. Measured on a set of benign prompts — including benign look-alikes: "how do I kill a Python process," "where can I buy a knife for cooking." A model that fails here is exhausting, drives users away, and erodes trust in safety itself.

Why the second axis is the hard one

Harmful-compliance is easy to care about — the harm is vivid. Over-refusal is quiet: no headline, no incident report, just a slow bleed of usefulness and the cultural cost of safety looking like an obstacle. XSTest (Röttger et al., 2024) exists specifically to measure this exaggerated safety — and the fact that it had to be built tells you how often the second axis gets dropped.

The degenerate-model test

Here's a habit to adopt for any metric you'll ever design, not just this one: ask what's the dumbest model that maxes this score? For a one-axis safety metric, it's "refuse everything." For a one-axis helpfulness metric, it's "comply with everything." The scorecard is exactly the pair of metrics whose degenerate models are opposites — so you can't win both by cheating. That mutual tension is what makes the pair trustworthy.

Carry this further than safety

The two-sided-scorecard instinct generalizes: precision needs recall, sensitivity needs specificity, growth needs retention. Any time someone hands you a single optimization target, find the axis it can be gamed against and measure that too. This is one of the most portable habits in all of evaluation.

Building toward the eval set

To measure both axes you need a balanced eval set: a class of genuinely harmful prompts and a class of benign ones (with deliberate look-alikes that bait over-refusal). You'll design that set properly on Day 19 and score it with code on Day 18. Today, just internalize the shape — two prompt classes, two metrics, one scorecard — because the rest of the week is filling it in.

Your work today

Read + Sketch the Scorecard

~60 minutes

Read §2–4 of XSTest: Identifying Exaggerated Safety Behaviours (Röttger et al., 2024). Note how they construct safe prompts that look unsafe — that's the over-refusal bait.
Skim Challenges in Evaluating AI Systems for why a single score misleads, and connect it to the degenerate-model idea.
Write the two axes in your own words, each with the formula (rate over which prompt class) and which direction is good.
Sketch a 10-prompt balanced eval set on paper: 5 harmful, 5 benign — and make at least two of the benign ones benign look-alikes that a jumpy model would wrongly refuse.

The expert move

A novice reports "our model is 98% safe." An expert refuses to report a single safety number at all, because they know a single number always has a degenerate maximizer — and the one for safety is a model that refuses everything. The altitude jump is from a score to a scorecard: measuring the axis your metric can be gamed against, so the number can't lie by omission.

Say this in an interview: "I never report one safety number, because the dumbest way to max it is to refuse everything — which is useless. I report a two-sided scorecard: harmful-compliance and over-refusal. The first keeps the model safe; the second keeps it from becoming a brick. A safety metric you can game by refusing more isn't a safety metric."

Today's Takeaways

A single safety number lies: "refuse everything" scores 100% safe and is useless.
Measure both axes — harmful-compliance (want low) and over-refusal (want low).
Over-refusal is the forgotten axis; benign look-alikes are how you bait and catch it.
The degenerate-model test — "what's the dumbest model that maxes this?" — is portable beyond safety.