Week 3 of 12 · Part A — Applied Safety

Designing a Red-Team Plan

Turning "let's try to break it" into a coverage plan with success criteria, logging, and an escalation path

Day 12 ~60 minutes Concept

Day 12 of 60

A red-team without a plan is just poking

Yesterday's reframe makes red-teaming defensive. A plan makes it measurable. Without one, a red-team is a pile of ad-hoc attempts: you don't know what you covered, what you skipped, how often you succeeded, or what happens when you find something real. The plan is the document that turns the activity into an operation — and it's the artifact a safety lead actually owns. Today you design one (you'll run the bookkeeping in code tomorrow on Day 13).

The thesis

A red-team plan answers four questions before anyone attacks anything: what categories will we cover, what counts as a success, what do we log on every attempt, and what happens when we find a real failure. Decide these up front and your results are comparable, auditable, and actionable. Decide them after the fact and you have anecdotes.

The four parts of a red-team plan

Core Theory

1 · Coverage — which attack categories, and how much?

List the attack categories from Day 11 (persona pressure, prompt injection, harmful-instruction elicitation, privacy extraction, multimodal evasion) and allocate effort across them deliberately. Coverage is itself a safety metric: an untested category is a blind spot, not a pass. The plan names what you'll probe so gaps are visible instead of accidental.

2 · Success criteria — what counts as "the attack succeeded"?

Define, per category, what a defended versus failed outcome looks like, decided against your Week 2 taxonomy and policy — not against vibes. "Attack success rate" (ASR) only means something if "success" is defined consistently across red-teamers. Ambiguous criteria produce ASR numbers nobody can trust.

3 · Logging fields — what do we record on every attempt?

The minimal defensive record per attempt: category, outcome (defended/failed), severity, timestamp, and a pointer to a secured, access-controlled store for any raw detail. The fields are chosen so coverage and ASR are computable — and so no operational payload sits in the open log.

4 · Escalation path — what happens on a real find?

A serious finding needs a route before you find it: who's notified, how it's contained, where the raw evidence is sealed, and how it becomes a ticket against a fix. A red-team that finds a critical failure and has nowhere to send it has wasted the find — and possibly created a disclosure problem.

This mirrors how a lab runs it: OpenAI's external red-teaming program is structured around who is brought in, for which risks, with what process — coverage, criteria, and handling, not improvisation. Your plan is the small version of the same thing.

Coverage and ASR, defined

Coverage = which attack categories you actually tested (and how much). Attack success rate (ASR) = within a category, the fraction of attempts where the model failed to defend. High ASR in a category means a weak spot; zero attempts in a category means a blind spot — and the second is more dangerous because it's invisible.

Build the well-being protocol into the plan

The plan is also where you protect the people. Because exposure to harmful content has a real human cost, a responsible plan bakes in safeguards from the start: rotation off heavy categories, exposure limits (caps on time or volume in the worst material), consent and opt-out, and support resources. This isn't separate from the operational design — a red-teamer's well-being is a constraint the plan must satisfy, exactly like coverage is.

Make it concrete

Sketch a plan for one real deployment you know — a support assistant, a coding agent, an image tool. Write the categories you'd cover, one success criterion per category, the four logging fields, the escalation path for a critical find, and a two-line well-being protocol. That's a real, defensible red-team plan on one page.

Your work today

Draft a Red-Team Plan

~60 minutes

Re-read the lessons of Red Teaming Language Models to Reduce Harms (Ganguli et al.) for how they define attacks and protect red-teamers — borrow their structure.
Skim OpenAI's Approach to External Red Teaming for how coverage and handling are structured as a program.
Write a one-page plan for one deployment: coverage (categories + effort), per-category success criteria, the logging fields, an escalation path, and a well-being protocol (rotation + exposure limits).

The expert move

A novice opens a red-team by attacking; an expert opens by deciding what would count as a finding and how it gets handled. The altitude jump is from generating attempts to designing an operation: coverage you can defend, success criteria tied to a written policy, a log that yields metrics, an escalation path that exists before the first critical find, and safeguards for the humans. Owning the plan means owning whether the results are trustworthy.

Say this in an interview: "Before any attack I write the plan: which attack categories we cover, what counts as success against our policy, the fields we log so coverage and ASR are computable, the escalation path for a real find, and a well-being protocol for the red-teamers. That plan is what makes the results auditable instead of anecdotal."

Today's Takeaways

A red-team plan answers four things up front: coverage, success criteria, logging fields, escalation path.
Coverage is a metric — an untested category is a blind spot, not a pass; ASR only means something with consistent criteria.
Success is judged against your Week 2 taxonomy and policy, not against gut feeling.
The plan carries the well-being protocol — rotation and exposure limits — as a first-class constraint.