Using one model to find another's failures — and knowing exactly where automation stops
Day 14 of 60
Humans red-team carefully but slowly. A handful of people can probe a few thousand attempts; a deployed model faces millions of real interactions. Manual red-teaming alone can never reach the coverage you'd want — and it concentrates the human cost on a few people. The defensive answer is to automate the search for failures: use a language model to generate candidate test cases against another model, so coverage scales and the worst exposure shifts off humans and onto a controlled pipeline.
Automated red-teaming is a coverage multiplier, not a replacement for judgment. A model can generate and triage failures far faster than people can, expanding the breadth you measure — but what counts as a failure, which findings matter, and where the bright lines are stay human. The win is reach; the limits are real and you must be able to name them.
The method comes from Red Teaming Language Models with Language Models (Perez et al., 2022): use one LM (the "red" LM) to generate large numbers of test inputs aimed at a target model, run the target on them, and use a classifier to flag which responses are failures. The loop turns red-teaming into something you can run at scale and re-run on every model version.
A model produces many candidate inputs targeting a category of weakness. You steer it at the categories from your plan — breadth comes from the model's ability to vary cases far faster than a person can type them.
Each test runs against the target model; a classifier (a safety classifier or LLM-judge) labels each response defended or failed. This is the automated version of the (category, defended?) logging you built on Day 13 — same fields, vastly more rows.
The labeled outcomes flow straight into the same coverage-and-ASR report. Automation doesn't change the metric; it changes the volume behind it, so categories you could barely sample by hand now have real numbers.
Standardized frameworks like HarmBench exist precisely so this scaled measurement is comparable across systems — when the search is automated, the definition of "attack success" has to be fixed and shared, or the numbers drift.
Automation buys reach and pays for it in judgment. A responsible lead names the limits out loud:
The judge is fallible. The classifier deciding defended-vs-failed has its own errors and biases; an automated ASR is only as honest as its labeler. It optimizes toward what it can measure. A red LM finds the failures it's pointed at and rewarded for — novel, out-of-distribution, or genuinely creative attacks still need humans. It can't set the bright lines. Which categories to probe, what's out of bounds, and what counts as a real harm are policy calls, not generation calls.
The strong operation isn't human or automated — it's both. Automation provides breadth (millions of cases, off human shoulders); humans provide depth and judgment (defining success, spotting novel failure classes, ruling the edge cases). The labs run it this way on purpose, pairing people with AI rather than choosing one.
An enthusiast treats automated red-teaming as a magic button that "tests everything." An expert treats it as a coverage multiplier with a fallible judge — scaling breadth while keeping success-definition, edge cases, and bright lines firmly human, and naming exactly where the automation stops. The altitude jump is from "we automated red-teaming" to "we use automation for breadth and humans for judgment, and I can tell you what each half can and can't catch."
Say this in an interview: "I'd scale red-teaming by having one model generate and triage cases against the target, feeding the outcomes into the same coverage-and-ASR report — but I'd keep the success criteria and the novel-failure hunting human, because the automated judge has its own biases and only finds what it's pointed at. Breadth from automation, depth from people."