Week 5 of 12 · Part A — Applied Safety

Defense-in-Depth, Measured

Turning "we have safeguards" into a coverage matrix that shows exactly where the gaps are

Day 23 ~75 minutes Build

Day 23 of 60

From "we have safeguards" to "here's the coverage"

Every team will tell you they have safety measures. Far fewer can show you, on one page, which measure stops which attack class — and where an attack walks straight through. Today you build that page. It's a humble artifact: a matrix of defensive layers against attack classes, filled with outcomes from a test run. No payloads, just bookkeeping. But it converts a vague reassurance into a defensible claim, and it makes the week's thesis impossible to ignore.

The thesis

Defense-in-depth isn't a slogan; it's a measurable property. The right question is never "are we safe?" but "for each attack class, which layers held — and is there any class with zero layers that held?" One uncovered row is a live vulnerability, no matter how strong the other rows look.

The four layers, and why you need all of them

Core Theory

1 · Input filter

Screens incoming content before it reaches the model — known-bad patterns, untrusted-source flagging, classifier checks. Cheap and fast, but blind to anything novel or hidden in a modality it doesn't parse.

2 · System prompt

The instructions and guardrails framing the model's behavior. Helpful for steering the default, but — as Day 21 showed — out-pullable by competing objectives and bypassable by out-of-distribution inputs.

3 · Safety tuning

The model's trained-in harmlessness. Strong on the distribution it was trained on, brittle off it. It's a default, not a wall.

4 · Output filter

Screens what the model produced before it reaches the user or a tool. The last line, and often the one that catches what the upstream layers missed — but only if the harm is detectable in the output itself.

The logic of layering

Each layer has a blind spot. Stack them and an attack must pass all blind spots simultaneously to succeed — which is much harder than beating any one. But the attacker only needs one complete path through. That asymmetry is why you measure per-attack-class coverage, not an average: an average hides the one row that's wide open.

Build it

In the Try This box is robustness_report.py — a defense-in-depth coverage matrix. For each attack class it records which layers stopped it (outcomes only, no attack content), prints a per-class verdict (defended or GAP), and drives home that robustness is a property of the whole stack. Run it, read the verdicts, then make it yours: edit the outcomes to reflect a system you actually know, and hunt for the row with no defense.

Make it yours

Replace the sample outcomes with your honest assessment of a real deployment — even a guess is informative. Then do the two honors moves: add a fifth defensive layer (say, action-level monitoring for agents) and re-measure, and explicitly name the single attack class with no layer holding. That uncovered class is your "fix this first." This is the exact artifact a safety review would open with.

Your work today

Build a Coverage Matrix

~75 minutes

  1. Run robustness_report.py from the Try This box and read its per-attack-class verdicts. Notice how indirect injection slips most layers — the point Day 22 made, now in data.
  2. Rewrite the results outcomes for a deployment you know, across at least four attack classes. Keep it outcomes-only — which layer held, never the attack itself.
  3. Add a second/extra defensive layer, re-measure, and write one sentence naming the attack class with the thinnest coverage and what you'd build to close it. Frame it against this week's reading: Many-shot Jailbreaking is a clean example of an attack that defeats some layers and forces new ones.
The expert move

A novice reports "we have four safety layers" as if the count were the point. An expert reports the coverage matrix — and leads with the empty row, because the strength of your strongest layer is irrelevant to the attack class that bypasses all of them. The altitude jump is from inventorying defenses to measuring them per threat, then prioritizing by gap, not by effort already spent.

Say this in an interview: "I don't claim a system is robust because it has safeguards — I build a coverage matrix of layers against attack classes and look for any class with zero layers holding. Defense-in-depth only counts if every attack path crosses more than one independent layer, and the matrix is how I prove it instead of asserting it."

Today's Takeaways