Week 1 of 12 · Part A — Applied Safety

The Map of the Field

The four pillars — and how this track's 12 weeks sit inside them

Day 4 ~60 minutes Concept

Day 4 of 60

One field, four pillars

AI safety can feel like a pile of disconnected topics — jailbreaks, interpretability, the EU AI Act. Unsolved Problems in ML Safety (Hendrycks et al.) gives the cleanest organizing map: four pillars that everything else slots into. Learn these and you'll always know where a new idea belongs.

Core Theory

1 · Robustness — withstand a hostile, shifting world

Does the system hold up under adversarial inputs and distribution shift? This is jailbreaks, prompt injection, adversarial examples. Your Week 5.

2 · Monitoring — know what the system is doing

Can we detect malfunctions, anomalies, and hidden behavior — and understand the model's internals? This is evaluation, anomaly detection, and interpretability. Your Weeks 4 and 8.

3 · Alignment — make the system want what we want

Can we get the model to pursue the intended objective rather than a gamed proxy or a learned mis-goal? This is reward hacking, deceptive alignment, scalable oversight. Your Weeks 6, 7, 9.

4 · Systemic safety — manage the broader context

How do we handle the organizational, economic, and geopolitical context that AI is deployed into? This is governance, risk frameworks, regulation. Your Weeks 10–11.

Why the map matters

When you read any new safety paper or news story, your first move is to place it: which pillar? A jailbreak demo is robustness. An interpretability result is monitoring. An alignment-faking paper is alignment. A new regulation is systemic. Placing it tells you what it's actually about — and what it isn't.

The wide-angle view: catastrophic risk

The four pillars are about how to make systems safe. It's also worth seeing the why at its largest scale. An Overview of Catastrophic AI Risks (Hendrycks et al.) groups the biggest concerns into four sources: malicious use, the AI race (competitive pressure cutting safety corners), organizational risks (accidents from how labs operate), and rogue AIs (loss of control). You don't need to buy every scenario — you need to recognize the categories, because they're the vocabulary of the governance debate in Part C.

Stay grounded

Catastrophic-risk framing can tip into sci-fi. The antidote is the discipline you're building: every large claim should connect back to something measurable — a robustness gap, a monitoring blind spot, an alignment failure, a governance hole. If it can't, treat it as speculation, not evidence.

Your work today

Place Everything

~60 minutes

Re-skim the four pillars in Unsolved Problems in ML Safety. Write a one-line definition of each in your own words.
Read the abstract and one section of Catastrophic AI Risks that interests you.
Take three recent AI headlines and assign each to a pillar. Notice how the exercise sharpens what each story is really claiming.

The expert move

A generalist has opinions about "AI risk." An expert has a taxonomy and instantly files any claim into it — robustness, monitoring, alignment, or systemic — which immediately reveals what kind of evidence would settle it. Having the map is what lets you stay calm and specific in a conversation full of hype.

Say this in an interview: "I organize the field into robustness, monitoring, alignment, and systemic safety. It keeps me precise — when someone raises a concern, I can say which pillar it's in, what evidence would bear on it, and who owns it — instead of treating 'AI safety' as one undifferentiated worry."

Today's Takeaways

Four pillars: robustness, monitoring, alignment, systemic safety — everything slots in.
This track maps onto them: W5 robustness; W4/W8 monitoring; W6–9 alignment; W10–11 systemic.
Catastrophic-risk sources: malicious use, AI race, organizational risk, rogue AIs.
Anchor every big claim to something measurable, or label it speculation.