Turning "let's try to break it" into a coverage plan with success criteria, logging, and an escalation path
Day 12 of 60
Yesterday's reframe makes red-teaming defensive. A plan makes it measurable. Without one, a red-team is a pile of ad-hoc attempts: you don't know what you covered, what you skipped, how often you succeeded, or what happens when you find something real. The plan is the document that turns the activity into an operation — and it's the artifact a safety lead actually owns. Today you design one (you'll run the bookkeeping in code tomorrow on Day 13).
A red-team plan answers four questions before anyone attacks anything: what categories will we cover, what counts as a success, what do we log on every attempt, and what happens when we find a real failure. Decide these up front and your results are comparable, auditable, and actionable. Decide them after the fact and you have anecdotes.
List the attack categories from Day 11 (persona pressure, prompt injection, harmful-instruction elicitation, privacy extraction, multimodal evasion) and allocate effort across them deliberately. Coverage is itself a safety metric: an untested category is a blind spot, not a pass. The plan names what you'll probe so gaps are visible instead of accidental.
Define, per category, what a defended versus failed outcome looks like, decided against your Week 2 taxonomy and policy — not against vibes. "Attack success rate" (ASR) only means something if "success" is defined consistently across red-teamers. Ambiguous criteria produce ASR numbers nobody can trust.
The minimal defensive record per attempt: category, outcome (defended/failed), severity, timestamp, and a pointer to a secured, access-controlled store for any raw detail. The fields are chosen so coverage and ASR are computable — and so no operational payload sits in the open log.
A serious finding needs a route before you find it: who's notified, how it's contained, where the raw evidence is sealed, and how it becomes a ticket against a fix. A red-team that finds a critical failure and has nowhere to send it has wasted the find — and possibly created a disclosure problem.
This mirrors how a lab runs it: OpenAI's external red-teaming program is structured around who is brought in, for which risks, with what process — coverage, criteria, and handling, not improvisation. Your plan is the small version of the same thing.
Coverage = which attack categories you actually tested (and how much). Attack success rate (ASR) = within a category, the fraction of attempts where the model failed to defend. High ASR in a category means a weak spot; zero attempts in a category means a blind spot — and the second is more dangerous because it's invisible.
The plan is also where you protect the people. Because exposure to harmful content has a real human cost, a responsible plan bakes in safeguards from the start: rotation off heavy categories, exposure limits (caps on time or volume in the worst material), consent and opt-out, and support resources. This isn't separate from the operational design — a red-teamer's well-being is a constraint the plan must satisfy, exactly like coverage is.
Sketch a plan for one real deployment you know — a support assistant, a coding agent, an image tool. Write the categories you'd cover, one success criterion per category, the four logging fields, the escalation path for a critical find, and a two-line well-being protocol. That's a real, defensible red-team plan on one page.
A novice opens a red-team by attacking; an expert opens by deciding what would count as a finding and how it gets handled. The altitude jump is from generating attempts to designing an operation: coverage you can defend, success criteria tied to a written policy, a log that yields metrics, an escalation path that exists before the first critical find, and safeguards for the humans. Owning the plan means owning whether the results are trustworthy.
Say this in an interview: "Before any attack I write the plan: which attack categories we cover, what counts as success against our policy, the fields we log so coverage and ASR are computable, the escalation path for a real find, and a well-being protocol for the red-teamers. That plan is what makes the results auditable instead of anecdotal."