The cleanest proof that being superhuman and being unbreakable are different things
Day 24 of 60
If you take one image out of this whole week, take this one. A Go AI that plays at a superhuman level — stronger than any human who has ever lived — can be beaten, reliably, by a deliberately crafted strategy that an amateur can learn and execute. Not beaten by a stronger AI. Beaten by a cheap exploit, run by a person who would lose a normal game in seconds. That is Adversarial Policies Beat Superhuman Go AIs (Wang et al., 2023, FAR AI), and it is the cleanest demonstration in all of ML that capability and robustness are separate properties.
Performance on the average case tells you nothing about behavior on the worst case. A system can be brilliant across the whole distribution it was trained on and have a gaping hole just outside it — and an adversary's entire job is to find that hole. High capability can even hide brittleness, because it makes the system look so competent that no one stress-tests the edges.
The attacking policy doesn't out-play the Go AI at Go. It steers the game into bizarre, off-distribution board states the strong AI never learned to handle, and wins there. The AI's strength is real and irrelevant in those states. Strength on-distribution, collapse off-distribution — same model.
This is structurally identical to the jailbreaks of Day 21 (mismatched generalization: push the input off the safety distribution and harmlessness doesn't fire) and to the long-context attack in Many-shot Jailbreaking (Anthropic, 2024), where a brand-new attack surface — very long contexts — opens precisely because capability grew. A more capable model with a longer context window is, in this specific way, a larger attack surface, not a smaller one.
"It's very capable" is an answer to a different question than "is it robust?" Capability is measured on the distribution you expect; robustness is measured on the distribution an adversary chooses. They can move in opposite directions. Never let a demo of competence stand in for evidence of robustness.
If even a superhuman, narrow system can be exploited this cleanly, then no single safeguard — however strong — is a robustness guarantee. This is the deepest argument for everything you built yesterday: defense-in-depth beats any single layer, because the single layer, no matter how capable, has an off-distribution hole, and the layers' holes don't perfectly overlap. Robustness is a property you engineer into the system, not one that arrives for free as capability scales.
When a vendor or a colleague leads with a capability benchmark, your reflex should be: "Impressive on-distribution — now show me the worst case. Who's looked for the adversarial policy, and what held?" Capability is the headline; robustness is the work.
A novice is reassured by a capability benchmark. An expert hears a capability claim and immediately asks for the worst case, because they've internalized that average-case excellence and worst-case fragility coexist in the same system. The altitude jump is owning the Go result as a portable argument — one clean, undeniable example that reframes any "but it's so capable" conversation in a sentence.
Say this in an interview: "My go-to example is the adversarial-policy result against superhuman Go AIs — a system stronger than any human, reliably beaten by a cheap exploit an amateur can run. It proves capability and robustness are different properties, so I never treat a benchmark score as evidence of safety, and I design for defense-in-depth because no single layer is robust off its own distribution."