Week 3 of 12 · Part A — Applied Safety

Coverage & Attack-Success, Measured

Turning a red-team operation into numbers — per-category ASR and the blind spots you haven't tested

Day 13 ~75 minutes Build

Day 13 of 60

From a plan to a measurement

Yesterday you designed a red-team plan. Today you build the bookkeeping that turns it into numbers a review can act on: per-category attack-success rate, and an explicit flag for the categories nobody tested. The headline insight of the build is that a red-team produces two kinds of risk — the categories where you tested and the model failed (visible weakness), and the categories you never tested at all (invisible weakness). A good report surfaces both, because the second is the one teams forget.

The thesis

An attack log is not the deliverable — the coverage-and-ASR report is. Reducing thousands of attempts to "which categories are weak, and which are untested" is what lets a lead decide where the next round of red-teaming and the next fix should go. And the entire report is computable from category + outcomezero operational payloads required.

What the bookkeeping computes

Core Theory

1 · Per-category attack-success rate (ASR)

Within each category, the fraction of attempts where the model failed to defend. A high ASR points to a real weakness to fix; a low ASR is evidence (not proof) the defense holds for that category at the tested intensity.

2 · Coverage — including the untested categories

The categories with zero attempts. These are the dangerous ones: a category with no data isn't safe, it's unknown. The report must name them loudly, because an unmeasured blind spot looks exactly like a pass on a dashboard that only shows ASR.

3 · A "weak" flag — where attention goes next

A simple threshold (say, ASR ≥ 50%) marks categories that need a fix or a deeper round. Combined with the untested flag, this gives a lead a one-glance answer to "where do we spend the next week?"

This is the same logic HarmBench formalizes at scale — standardizing how attack success is measured so numbers are comparable across systems and time. Your script is the hand-rolled version of that idea: the point isn't the tool, it's the discipline of defining success and coverage so the numbers mean something.

Build it

In the Try This box is redteam_log.py — a minimal red-team bookkeeper that logs each attempt as (category, defended?), then computes per-category ASR and flags untested categories. By design it contains zero operational attack strings: the raw prompts live in a secured, access-controlled store, and only category + outcome reach this report. Run it, read the output, then make it yours.

Make it yours

Replace the sample attempts with outcomes from the plan you wrote on Day 12 (you can invent plausible defended/failed outcomes per category — no real attack content needed). Then add a severity field to each attempt and sort the report so the highest-severity weak categories rise to the top. Look at your weakest-covered category and write one line on how you'd test it next.

Why payload-free is the design, not a limitation

Everything a lead needs — ASR, coverage, severity, trend — comes from category + outcome + severity. The operational string adds risk to your log and nothing to the metric. Building the report this way isn't censoring your data; it's the correct, defensive engineering choice. A red-team artifact that needs the payloads to be useful is a leaked exploit kit waiting to happen.

Your work today

Build a Coverage & ASR Report

~75 minutes

  1. Run redteam_log.py from the Try This box and read the per-category ASR + untested-category output.
  2. Skim the framework section of HarmBench (Mazeika et al.) to see how the field standardizes attack-success measurement — then mirror that rigor in how you define "defended."
  3. Rewrite the attempts with category + outcome from your Day 12 plan, add a severity field, and produce a report that flags both your weak and your untested categories — then write one line naming your weakest-covered category and how you'd test it next.
The expert move

A novice reports the attacks that worked; an expert reports coverage and ASR — and shouts about the categories with no data. The altitude jump is from "here's what I broke" to "here's our measured weakness and our blind spots, ranked by severity, computed without storing a single payload." Owning the report means owning where the program looks next.

Say this in an interview: "I score a red-team by per-category attack-success rate and by coverage — and I treat untested categories as risks, not passes. The whole report is computed from category and outcome, never from stored attack strings, because the metric needs the category, not the payload."

Today's Takeaways