Turning a red-team operation into numbers — per-category ASR and the blind spots you haven't tested
Day 13 of 60
Yesterday you designed a red-team plan. Today you build the bookkeeping that turns it into numbers a review can act on: per-category attack-success rate, and an explicit flag for the categories nobody tested. The headline insight of the build is that a red-team produces two kinds of risk — the categories where you tested and the model failed (visible weakness), and the categories you never tested at all (invisible weakness). A good report surfaces both, because the second is the one teams forget.
An attack log is not the deliverable — the coverage-and-ASR report is. Reducing thousands of attempts to "which categories are weak, and which are untested" is what lets a lead decide where the next round of red-teaming and the next fix should go. And the entire report is computable from category + outcome — zero operational payloads required.
Within each category, the fraction of attempts where the model failed to defend. A high ASR points to a real weakness to fix; a low ASR is evidence (not proof) the defense holds for that category at the tested intensity.
The categories with zero attempts. These are the dangerous ones: a category with no data isn't safe, it's unknown. The report must name them loudly, because an unmeasured blind spot looks exactly like a pass on a dashboard that only shows ASR.
A simple threshold (say, ASR ≥ 50%) marks categories that need a fix or a deeper round. Combined with the untested flag, this gives a lead a one-glance answer to "where do we spend the next week?"
This is the same logic HarmBench formalizes at scale — standardizing how attack success is measured so numbers are comparable across systems and time. Your script is the hand-rolled version of that idea: the point isn't the tool, it's the discipline of defining success and coverage so the numbers mean something.
In the Try This box is redteam_log.py — a minimal red-team bookkeeper that logs each attempt as (category, defended?), then computes per-category ASR and flags untested categories. By design it contains zero operational attack strings: the raw prompts live in a secured, access-controlled store, and only category + outcome reach this report. Run it, read the output, then make it yours.
Replace the sample attempts with outcomes from the plan you wrote on Day 12 (you can invent plausible defended/failed outcomes per category — no real attack content needed). Then add a severity field to each attempt and sort the report so the highest-severity weak categories rise to the top. Look at your weakest-covered category and write one line on how you'd test it next.
Everything a lead needs — ASR, coverage, severity, trend — comes from category + outcome + severity. The operational string adds risk to your log and nothing to the metric. Building the report this way isn't censoring your data; it's the correct, defensive engineering choice. A red-team artifact that needs the payloads to be useful is a leaked exploit kit waiting to happen.
redteam_log.py from the Try This box and read the per-category ASR + untested-category output.attempts with category + outcome from your Day 12 plan, add a severity field, and produce a report that flags both your weak and your untested categories — then write one line naming your weakest-covered category and how you'd test it next.A novice reports the attacks that worked; an expert reports coverage and ASR — and shouts about the categories with no data. The altitude jump is from "here's what I broke" to "here's our measured weakness and our blind spots, ranked by severity, computed without storing a single payload." Owning the report means owning where the program looks next.
Say this in an interview: "I score a red-team by per-category attack-success rate and by coverage — and I treat untested categories as risks, not passes. The whole report is computed from category and outcome, never from stored attack strings, because the metric needs the category, not the payload."