Week 4 of 12 · Part A — Applied Safety

Designing an Honest Eval Set

The code is easy — the prompts, the controls, and the pass bar are where evals earn trust

Day 19 ~75 minutes Concept

Day 19 of 60

The scorecard is only as honest as the set under it

Day 18's code is twenty lines. The hard part of evaluation was never the arithmetic — it's the set the arithmetic runs over. A two-sided scorecard computed on a biased, contaminated, or unrepresentative set produces a number that's precise and wrong. Today you design the eval set properly: the prompt classes, the controls that keep it honest, the metrics, and the pass bar you'd defend to a release committee.

The thesis

An eval set is a sampling claim: "scoring well on these prompts predicts safe behavior in deployment." Every design decision either strengthens or quietly breaks that claim. The flattering-but-broken eval — green checkmark, dangerous model — almost always fails at the set, not the code.

The anatomy of an honest eval set

Core Theory

1 · Balanced prompt classes

A harmful class and a benign class — because the scorecard has two axes. The benign class must include benign look-alikes (surface features of harm, fully legitimate intent), or your over-refusal number is measured on softballs and reads better than reality.

2 · Coverage, not convenience

Span your harm taxonomy (Week 2) across categories and severities, and span benign requests across the domains real users actually bring. A set that's all one category measures one thing and implies everything.

3 · Contamination controls

Could these prompts be in the model's training data? Use freshly authored or held-out items, paraphrase known benchmarks, and keep a private split you never publish. A leaked set measures memorization and silently inflates the score.

4 · Metrics + a justified pass bar

Report safe-refusal, harmful-compliance, and over-refusal — and justify each. Then set the bar before you see results (e.g. "harmful-compliance < 2%, over-refusal < 10%"). A pass bar chosen after the fact isn't a bar; it's a rationalization.

Why the pass bar comes first

If you decide what counts as "passing" after seeing the numbers, you'll unconsciously fit the bar to the result you want. Pre-registering the bar — and the metrics, and why each matters — is what makes the eval a decision tool instead of a justification. This is the same discipline that keeps red-team rankings honest in Week 3.

Detecting a flattering-but-broken eval

The dangerous eval isn't the one that fails — it's the one that passes a model it shouldn't. Learn to smell it. Warning signs: a suspiciously high safe-refusal rate with an over-refusal axis that's missing or measured on trivial prompts; benign prompts with no real look-alikes; a public set the model likely trained on; a pass bar that appeared after the results. Your job designing an eval is to try to break your own set before it breaks a release decision.

Red-team your own eval

For your draft eval set, write down the single way it could pass a genuinely unsafe model. That sentence is your eval's weakest link — and naming it is the difference between an eval you trust and one you merely hope is fine. (You'll name this weakest link again tomorrow in the Week 4 review.)

Judge biases, written down

If your design scales scoring with an LLM-as-judge (Day 18), the design doc must name the biases you'll control for: position/verbosity preference, self-preference for the judge's own style, leniency on polite refusals, and sensitivity to formatting. The control is calibration against human labels plus reporting the human-judge disagreement rate. An eval doc that uses a judge but doesn't account for its bias is hiding a variable in plain sight.

Your work today

Write the Eval-Set Design Doc

~75 minutes

Re-read §2–4 of XSTest for how a benign-look-alike class is constructed, and skim Challenges in Evaluating AI Systems for the contamination and judge-bias sections.
Write a one-page eval-set design doc: the harmful + benign prompt classes (with look-alikes), how many per class and why, and the taxonomy coverage you're claiming.
Specify contamination controls (held-out / freshly authored / private split) and justify each of the three metrics you'll report.
Pre-register the pass bar — exact thresholds for each metric — before running anything, and add a note on which judge biases you'd control for.
Finish with the "flattering-but-broken" sentence: the one way this set could pass an unsafe model. Browse HELM for how a mature suite documents many metrics and their rationale.

The expert move

A novice grades their eval by whether it runs. An expert grades it by whether it could pass a model that doesn't deserve to pass — and pre-registers metrics and the pass bar before seeing a single result, so the eval is a decision tool, not a rationalization. The altitude jump is from "I scored the model" to "I can defend that this score predicts deployment safety, and here's the one way it might not."

Say this in an interview: "I treat an eval set as a sampling claim and attack it before I trust it: balanced harmful and benign classes with real look-alikes, contamination controls, metrics and a pass bar I pre-register before seeing results. The question I'm really answering isn't 'did the model pass' — it's 'could this set pass a model that shouldn't,' and I write that weakest link down."

Today's Takeaways

An eval set is a sampling claim; the honesty lives in the set, not the code.
Anatomy: balanced classes (with look-alikes), coverage, contamination controls, justified metrics + a pre-registered pass bar.
The dangerous eval is the flattering-but-broken one — name the way it could pass an unsafe model.
If you use an LLM-judge, document its biases and your calibration in the design doc.