Turning "we have safeguards" into a coverage matrix that shows exactly where the gaps are
Day 23 of 60
Every team will tell you they have safety measures. Far fewer can show you, on one page, which measure stops which attack class — and where an attack walks straight through. Today you build that page. It's a humble artifact: a matrix of defensive layers against attack classes, filled with outcomes from a test run. No payloads, just bookkeeping. But it converts a vague reassurance into a defensible claim, and it makes the week's thesis impossible to ignore.
Defense-in-depth isn't a slogan; it's a measurable property. The right question is never "are we safe?" but "for each attack class, which layers held — and is there any class with zero layers that held?" One uncovered row is a live vulnerability, no matter how strong the other rows look.
Screens incoming content before it reaches the model — known-bad patterns, untrusted-source flagging, classifier checks. Cheap and fast, but blind to anything novel or hidden in a modality it doesn't parse.
The instructions and guardrails framing the model's behavior. Helpful for steering the default, but — as Day 21 showed — out-pullable by competing objectives and bypassable by out-of-distribution inputs.
The model's trained-in harmlessness. Strong on the distribution it was trained on, brittle off it. It's a default, not a wall.
Screens what the model produced before it reaches the user or a tool. The last line, and often the one that catches what the upstream layers missed — but only if the harm is detectable in the output itself.
Each layer has a blind spot. Stack them and an attack must pass all blind spots simultaneously to succeed — which is much harder than beating any one. But the attacker only needs one complete path through. That asymmetry is why you measure per-attack-class coverage, not an average: an average hides the one row that's wide open.
In the Try This box is robustness_report.py — a defense-in-depth coverage matrix. For each attack class it records which layers stopped it (outcomes only, no attack content), prints a per-class verdict (defended or GAP), and drives home that robustness is a property of the whole stack. Run it, read the verdicts, then make it yours: edit the outcomes to reflect a system you actually know, and hunt for the row with no defense.
Replace the sample outcomes with your honest assessment of a real deployment — even a guess is informative. Then do the two honors moves: add a fifth defensive layer (say, action-level monitoring for agents) and re-measure, and explicitly name the single attack class with no layer holding. That uncovered class is your "fix this first." This is the exact artifact a safety review would open with.
robustness_report.py from the Try This box and read its per-attack-class verdicts. Notice how indirect injection slips most layers — the point Day 22 made, now in data.results outcomes for a deployment you know, across at least four attack classes. Keep it outcomes-only — which layer held, never the attack itself.A novice reports "we have four safety layers" as if the count were the point. An expert reports the coverage matrix — and leads with the empty row, because the strength of your strongest layer is irrelevant to the attack class that bypasses all of them. The altitude jump is from inventorying defenses to measuring them per threat, then prioritizing by gap, not by effort already spent.
Say this in an interview: "I don't claim a system is robust because it has safeguards — I build a coverage matrix of layers against attack classes and look for any class with zero layers holding. Defense-in-depth only counts if every attack path crosses more than one independent layer, and the matrix is how I prove it instead of asserting it."