The four pillars — and how this track's 12 weeks sit inside them
Day 4 of 60
AI safety can feel like a pile of disconnected topics — jailbreaks, interpretability, the EU AI Act. Unsolved Problems in ML Safety (Hendrycks et al.) gives the cleanest organizing map: four pillars that everything else slots into. Learn these and you'll always know where a new idea belongs.
Does the system hold up under adversarial inputs and distribution shift? This is jailbreaks, prompt injection, adversarial examples. Your Week 5.
Can we detect malfunctions, anomalies, and hidden behavior — and understand the model's internals? This is evaluation, anomaly detection, and interpretability. Your Weeks 4 and 8.
Can we get the model to pursue the intended objective rather than a gamed proxy or a learned mis-goal? This is reward hacking, deceptive alignment, scalable oversight. Your Weeks 6, 7, 9.
How do we handle the organizational, economic, and geopolitical context that AI is deployed into? This is governance, risk frameworks, regulation. Your Weeks 10–11.
When you read any new safety paper or news story, your first move is to place it: which pillar? A jailbreak demo is robustness. An interpretability result is monitoring. An alignment-faking paper is alignment. A new regulation is systemic. Placing it tells you what it's actually about — and what it isn't.
The four pillars are about how to make systems safe. It's also worth seeing the why at its largest scale. An Overview of Catastrophic AI Risks (Hendrycks et al.) groups the biggest concerns into four sources: malicious use, the AI race (competitive pressure cutting safety corners), organizational risks (accidents from how labs operate), and rogue AIs (loss of control). You don't need to buy every scenario — you need to recognize the categories, because they're the vocabulary of the governance debate in Part C.
Catastrophic-risk framing can tip into sci-fi. The antidote is the discipline you're building: every large claim should connect back to something measurable — a robustness gap, a monitoring blind spot, an alignment failure, a governance hole. If it can't, treat it as speculation, not evidence.
A generalist has opinions about "AI risk." An expert has a taxonomy and instantly files any claim into it — robustness, monitoring, alignment, or systemic — which immediately reveals what kind of evidence would settle it. Having the map is what lets you stay calm and specific in a conversation full of hype.
Say this in an interview: "I organize the field into robustness, monitoring, alignment, and systemic safety. It keeps me precise — when someone raises a concern, I can say which pillar it's in, what evidence would bear on it, and who owns it — instead of treating 'AI safety' as one undifferentiated worry."