Week 3 of 12 · Part A — Applied Safety

Automated Red-Teaming at Scale

Using one model to find another's failures — and knowing exactly where automation stops

Day 14 ~60 minutes Concept

Day 14 of 60

The scaling problem manual red-teaming can't solve

Humans red-team carefully but slowly. A handful of people can probe a few thousand attempts; a deployed model faces millions of real interactions. Manual red-teaming alone can never reach the coverage you'd want — and it concentrates the human cost on a few people. The defensive answer is to automate the search for failures: use a language model to generate candidate test cases against another model, so coverage scales and the worst exposure shifts off humans and onto a controlled pipeline.

The thesis

Automated red-teaming is a coverage multiplier, not a replacement for judgment. A model can generate and triage failures far faster than people can, expanding the breadth you measure — but what counts as a failure, which findings matter, and where the bright lines are stay human. The win is reach; the limits are real and you must be able to name them.

How a model red-teams another model

The method comes from Red Teaming Language Models with Language Models (Perez et al., 2022): use one LM (the "red" LM) to generate large numbers of test inputs aimed at a target model, run the target on them, and use a classifier to flag which responses are failures. The loop turns red-teaming into something you can run at scale and re-run on every model version.

Core Theory

1 · Generate — the red LM proposes test cases

A model produces many candidate inputs targeting a category of weakness. You steer it at the categories from your plan — breadth comes from the model's ability to vary cases far faster than a person can type them.

2 · Execute — run the target and classify outcomes

Each test runs against the target model; a classifier (a safety classifier or LLM-judge) labels each response defended or failed. This is the automated version of the (category, defended?) logging you built on Day 13 — same fields, vastly more rows.

3 · Measure — feed it back into coverage + ASR

The labeled outcomes flow straight into the same coverage-and-ASR report. Automation doesn't change the metric; it changes the volume behind it, so categories you could barely sample by hand now have real numbers.

Standardized frameworks like HarmBench exist precisely so this scaled measurement is comparable across systems — when the search is automated, the definition of "attack success" has to be fixed and shared, or the numbers drift.

The limits — where automation stops

Automation buys reach and pays for it in judgment. A responsible lead names the limits out loud:

What automated red-teaming can't do for you

The judge is fallible. The classifier deciding defended-vs-failed has its own errors and biases; an automated ASR is only as honest as its labeler. It optimizes toward what it can measure. A red LM finds the failures it's pointed at and rewarded for — novel, out-of-distribution, or genuinely creative attacks still need humans. It can't set the bright lines. Which categories to probe, what's out of bounds, and what counts as a real harm are policy calls, not generation calls.

The combination is the answer

The strong operation isn't human or automated — it's both. Automation provides breadth (millions of cases, off human shoulders); humans provide depth and judgment (defining success, spotting novel failure classes, ruling the edge cases). The labs run it this way on purpose, pairing people with AI rather than choosing one.

Your work today

Read the Automation Method

~60-minute foundation

Read the method of Red Teaming Language Models with Language Models (Perez et al.) — focus on the generate → execute → classify loop and what role the classifier plays.
Read the framework section of HarmBench (Mazeika et al.) for how standardized, automated attack-success evaluation is defined so results are comparable.
See how a lab combines the two in practice in OpenAI's Approach to External Red Teaming — note where people are used and where automation is.
In a notebook, write one paragraph on how you'd combine human and automated red-teaming for one deployment, and list two limits of the automated half.

The expert move

An enthusiast treats automated red-teaming as a magic button that "tests everything." An expert treats it as a coverage multiplier with a fallible judge — scaling breadth while keeping success-definition, edge cases, and bright lines firmly human, and naming exactly where the automation stops. The altitude jump is from "we automated red-teaming" to "we use automation for breadth and humans for judgment, and I can tell you what each half can and can't catch."

Say this in an interview: "I'd scale red-teaming by having one model generate and triage cases against the target, feeding the outcomes into the same coverage-and-ASR report — but I'd keep the success criteria and the novel-failure hunting human, because the automated judge has its own biases and only finds what it's pointed at. Breadth from automation, depth from people."

Today's Takeaways

Manual red-teaming can't reach deployment-scale coverage — automation is the defensive multiplier.
The method is generate → execute → classify: one LM proposes cases, the target runs them, a judge labels outcomes into the same coverage + ASR report.
The limits are real: the judge is fallible, it finds only what it's pointed at, and it can't set the bright lines.
The strong operation combines automated breadth with human depth and judgment.