Week 4 of 12 · Part A — Applied Safety

The Eval Harness

Locking in Week 4 — and assembling the Part A chain: taxonomy → red-team → eval

Day 20 ~50 minutes Review

Day 20 of 60

What you now hold

Four weeks in, you can do something most people who "care about AI safety" cannot: produce a number you'd defend. You separate evaluation from red-teaming (measurement vs discovery), you measure safety on two axes instead of one, you've built a runnable scorecard that proves the single number lies, and you can design an eval set honest enough to survive its own contamination and degenerate-model checks.

The through-line of Week 4

A safety eval is only as trustworthy as the failure it can't hide. The whole week turns on one move: measure both harmful-compliance and over-refusal, because a model that refuses everything scores perfectly on one axis and is useless. One number can be gamed; the two-sided scorecard can't.

The Part A chain, assembled

This is the week the first part of the track clicks into a single story. Each artifact you built feeds the next, and being able to tell that chain end-to-end is the real deliverable of Part A:

The Chain

1 · Taxonomy (Week 2) — defines "harmful"

Categories and severity tiers turn "is this bad?" into a labelable question. Nothing downstream can be measured until this exists. Your eval's harmful prompt classes are sampled from these categories.

2 · Red-team (Week 3) — finds the failures

Open-ended discovery surfaces the failures the taxonomy categories actually manifest as in this model. Red-team incidents become eval cases — discovery refilling the measurement set.

3 · Eval (Week 4) — measures them repeatably

The two-sided scorecard, run on an honest set, tells you whether the model is safe and still useful — repeatably, so you can compare across models and time. The number a release decision can rest on.

Why telling the chain is the skill

Anyone can describe one artifact. The hire-able skill is narrating the whole pipeline — how a taxonomy makes red-teaming labelable, how red-team finds make an eval set real, how the eval turns it all into a defensible number — and then naming the weakest link in your own chain and how you'd strengthen it. Owning the chain is owning the safety story end to end.

Self-quiz — can you do these without notes?

Prove the Week

~50 minutes

State the difference between red-teaming and evaluation in one sentence each — and describe the loop between them.
Name the two axes of the safety scorecard, the formula for each, and which direction is good. Then explain why a model that refuses everything exposes the single-number lie.
List the four parts of an honest eval set (balanced classes, coverage, contamination controls, metrics + pre-registered pass bar) and why each one keeps the score honest.
Explain how you'd use an LLM-as-judge to scale scoring and which judge biases you'd control for.
Tell the Part A chain — taxonomy → red-team → eval — as one paragraph, then name the weakest link in your own version and how you'd fix it. Re-skim §1–3 of Evaluating Frontier Models for Dangerous Capabilities if any answer feels shaky.

The expert move

A practitioner ships an eval and reports its number. An expert treats the eval as a claim to be attacked — and can tell the whole Part A chain as one argument, then point unprompted at its weakest link. The altitude jump is from owning an artifact to owning a pipeline: the ability to say not just "the model scored X," but "here's how taxonomy, red-team, and eval combine to make X mean something, and here's exactly where I'd distrust it."

Say this in an interview: "Part A is one chain for me: a taxonomy defines harm, red-teaming discovers how it shows up, and a two-sided eval measures it repeatably — both harmful-compliance and over-refusal, so the number can't be gamed by refusing more. And I'll tell you the weakest link in my own harness before you ask, because the eval I trust is the one I've already tried to break."

Week 4 Takeaways

Evaluation is measurement, red-teaming is discovery — and the two form a loop.
The two-sided scorecard is the week: harmful-compliance AND over-refusal, because one number always lies.
An honest eval set survives contamination, look-alike, and degenerate-model checks, with a pre-registered pass bar.
Part A is one chain — taxonomy → red-team → eval — and the skill is telling it and naming its weakest link.