Week 5 of 12 · Part A — Applied Safety

Capability ≠ Robustness

The cleanest proof that being superhuman and being unbreakable are different things

Day 24 ~60 minutes Concept

Day 24 of 60

The example that ends the argument

If you take one image out of this whole week, take this one. A Go AI that plays at a superhuman level — stronger than any human who has ever lived — can be beaten, reliably, by a deliberately crafted strategy that an amateur can learn and execute. Not beaten by a stronger AI. Beaten by a cheap exploit, run by a person who would lose a normal game in seconds. That is Adversarial Policies Beat Superhuman Go AIs (Wang et al., 2023, FAR AI), and it is the cleanest demonstration in all of ML that capability and robustness are separate properties.

The thesis

Performance on the average case tells you nothing about behavior on the worst case. A system can be brilliant across the whole distribution it was trained on and have a gaping hole just outside it — and an adversary's entire job is to find that hole. High capability can even hide brittleness, because it makes the system look so competent that no one stress-tests the edges.

Why this generalizes far past Go

Core Theory

The Go result, in one breath

The attacking policy doesn't out-play the Go AI at Go. It steers the game into bizarre, off-distribution board states the strong AI never learned to handle, and wins there. The AI's strength is real and irrelevant in those states. Strength on-distribution, collapse off-distribution — same model.

The same shape in language models

This is structurally identical to the jailbreaks of Day 21 (mismatched generalization: push the input off the safety distribution and harmlessness doesn't fire) and to the long-context attack in Many-shot Jailbreaking (Anthropic, 2024), where a brand-new attack surface — very long contexts — opens precisely because capability grew. A more capable model with a longer context window is, in this specific way, a larger attack surface, not a smaller one.

Read this back to yourself

"It's very capable" is an answer to a different question than "is it robust?" Capability is measured on the distribution you expect; robustness is measured on the distribution an adversary chooses. They can move in opposite directions. Never let a demo of competence stand in for evidence of robustness.

What this implies for a defender

If even a superhuman, narrow system can be exploited this cleanly, then no single safeguard — however strong — is a robustness guarantee. This is the deepest argument for everything you built yesterday: defense-in-depth beats any single layer, because the single layer, no matter how capable, has an off-distribution hole, and the layers' holes don't perfectly overlap. Robustness is a property you engineer into the system, not one that arrives for free as capability scales.

The reframe to keep

When a vendor or a colleague leads with a capability benchmark, your reflex should be: "Impressive on-distribution — now show me the worst case. Who's looked for the adversarial policy, and what held?" Capability is the headline; robustness is the work.

Your work today

Internalize the Go Result

~60 minutes

  1. Read §1 and the results of Adversarial Policies Beat Superhuman Go AIs. In one sentence, write what the attacking policy actually does — and why the AI's strength doesn't save it.
  2. Read Many-shot Jailbreaking and write how a larger context window became a new attack surface — the same capability-grows-surface-grows pattern.
  3. Find one more "capable but brittle" example from any domain you know (vision systems, autonomous driving, recommendation, security). Write two sentences connecting it to the capability ≠ robustness thesis.
The expert move

A novice is reassured by a capability benchmark. An expert hears a capability claim and immediately asks for the worst case, because they've internalized that average-case excellence and worst-case fragility coexist in the same system. The altitude jump is owning the Go result as a portable argument — one clean, undeniable example that reframes any "but it's so capable" conversation in a sentence.

Say this in an interview: "My go-to example is the adversarial-policy result against superhuman Go AIs — a system stronger than any human, reliably beaten by a cheap exploit an amateur can run. It proves capability and robustness are different properties, so I never treat a benchmark score as evidence of safety, and I design for defense-in-depth because no single layer is robust off its own distribution."

Today's Takeaways