Week 4 of 12 · Part A — Applied Safety

Building the Safety Eval

Turning the two-sided scorecard from an idea into a runnable, defensible number

Day 18 ~75 minutes Build

Day 18 of 60

From scorecard to code

Yesterday the two-sided scorecard was an argument. Today it's a program. safety_eval.py takes a set of labeled rows — each one a prompt tagged harmful or benign and a record of whether the model refused — and computes the three numbers that a safety review actually needs: safe-refusal rate, harmful-compliance rate, and over-refusal rate. Running it makes the central lesson of the week impossible to forget.

The thesis

The eval's whole reason to exist is in its last two lines: a model that refuses everything scores 100% safe-refusal AND 100% over-refusal. One number lies; the scorecard tells the truth. Building it yourself — watching both rates move — is how the idea stops being a slogan and becomes an instinct.

What the code actually does

Core Theory

1 · Split by the ground truth, not the model

Every row carries request_is_harmful — the label you assigned, the ground truth. The code splits rows into the harmful set and the benign set by that label, never by what the model did. The model's behavior is what you're scoring, so it can't define the buckets.

2 · Two denominators, two rates

Safe-refusal and harmful-compliance are rates over the harmful rows. Over-refusal is a rate over the benign rows. Different denominators — that's why a single overall "accuracy" blurs the two failures together and hides the one teams forget.

3 · The degenerate check, in code

Feed it a "refuse everything" transcript and watch safe-refusal hit 100% while over-refusal also hits 100%. The scorecard surfaces the cheat the single number conceals. That's the demonstration, not a footnote.

Make it yours

Replace the toy rows with 12–20 of your own labeled cases — and seed at least three benign look-alikes ("how do I kill a stuck process," "what household chemicals should I never mix") so the over-refusal axis has something real to catch. The set you build here becomes the seed of your Day 19 eval-set design.

Scaling the scoring: the LLM-as-judge

The toy version assumes you already know whether the model "refused." At real scale, a human can't read every transcript, so teams use an LLM-as-judge: another model reads each response and labels refused/complied. It scales scoring — but it imports the judge's own biases (it may favor verbose answers, or its own house style, or mislabel a polite hedge as a refusal). The honest move is to calibrate the judge against human labels on a sample before you trust it on the rest.

Honors: wire in a judge stub

Add a judge(response) -> "refused" | "complied" stub (even a keyword heuristic), run it over your rows, and compare its labels to your own on ~10 items. Where it disagrees with you is exactly where the judge's bias lives — and that disagreement rate is a number you'd report alongside the scorecard, because a judge you haven't calibrated is a confident guess. Challenges in Evaluating AI Systems covers judge bias directly.

Your work today

Build the Scorecard

~75 minutes

  1. Run safety_eval.py from the Try This box and read all three rates plus the final two lines about the model that refuses everything.
  2. Replace rows with 12–20 of your own labeled cases, including 3+ benign look-alikes that bait over-refusal. Re-run and read the scorecard.
  3. Construct a "refuses everything" version of your rows and confirm safe-refusal and over-refusal both hit 100% — write one sentence on why that proves the single number misleads.
  4. (Honors) Add a judge() stub, label your rows with it, and compare to your own labels on 10 items. Note where it's biased. For the rigor bar, skim §1–3 of Evaluating Frontier Models for Dangerous Capabilities on how labs report eval results honestly. Browse HELM to see a serious suite report many axes at once instead of one score.
The expert move

A beginner writes an eval that prints a number. An expert writes an eval whose structure embeds the failure mode it's defending against — the scorecard reports both axes precisely because a single number has a degenerate maximizer, and the code proves it by showing the "refuse everything" model failing both axes at once. And the expert never trusts an LLM-judge they haven't calibrated against human labels, because an uncalibrated judge just launders its own bias into your headline metric.

Say this in an interview: "My safety eval reports safe-refusal, harmful-compliance, and over-refusal — three numbers, two denominators — because the structure itself blocks the 'refuse everything' cheat. When I scale scoring with an LLM-as-judge, I calibrate it against human labels on a sample first and report the disagreement rate, so the judge's bias is visible instead of baked silently into the result."

Today's Takeaways