Week 10 of 12 · Part C — Governance

Responsible Scaling Policies

How labs tie a model's capabilities to the safeguards it must clear before training or deployment

Day 49 ~60 minutes Concept

Day 49 of 60

The problem these policies solve

NIST gives you a process and a register gives you accountability, but neither answers a sharper question frontier labs face: at what point does a model become dangerous enough that we must not ship it without specific safeguards? You can't answer that with a flat policy, because the model keeps getting more capable. The answer the field converged on is a capability-gated commitment: pre-commit to thresholds of dangerous capability, and bind each threshold to mandatory safeguards that must be in place before you cross it.

The thesis

A responsible scaling policy turns safety from a judgment call made under deployment pressure into a pre-committed rule made in calm. The key move is conditioning safeguards on measured capability, not on the calendar or the competition. If the model can do X, we must have safeguard Y in place — decided in advance, when no one's racing.

How a capability-gated framework works

Anthropic's Responsible Scaling Policy is the canonical example. Its structure generalizes to OpenAI's Preparedness Framework and Google DeepMind's Frontier Safety Framework — different names, same logic.

Core Theory

1 · Capability thresholds — defined in advance

The policy names specific dangerous capabilities (e.g. meaningful uplift to a biological or cyber attacker) and defines tiers — Anthropic's AI Safety Levels (ASL). Crossing a tier is a measurable event, not a vibe.

2 · Dangerous-capability evaluations — the trigger

You can't gate on a capability you can't measure. So each threshold is paired with an eval designed to detect it. These evaluations are the tripwire: a model that scores past the line triggers the next tier's requirements. This is Part A's eval discipline pointed at the highest-stakes question.

3 · Required safeguards — bound to each tier

Each tier mandates safeguards before you may train or deploy at that level: stronger security to prevent model theft, deployment mitigations, red-teaming, and so on. The safeguards are a precondition, not an afterthought.

4 · The commitment to pause

The teeth of the policy: if a model crosses a threshold and the required safeguards aren't ready, the commitment is to not deploy (or not keep training) until they are. A framework with no pause condition is a press release.

Why pre-commitment is the whole idea

The dangerous moment for safety is exactly when a model is impressive and a competitor is shipping. Deciding the rules then guarantees they bend. Capability-gated policies move the decision earlier — to a moment of calm — so that when the pressure comes, the answer is already written down and the only question is whether the eval tripped.

How this connects to your register

This is the same machinery you built on Day 48, scaled to the frontier. A capability threshold is a risk. A dangerous-capability eval is its detection. A required safeguard is its mitigation. The deployment gate is the owner's go/no-go decision. A responsible scaling policy is, in effect, a risk register where the highest-impact rows have hard, pre-committed gates instead of soft mitigations — which is exactly what you'd want for risks where being wrong is catastrophic.

What this means for you

When you read one of these frameworks, find the trigger and the consequence for each tier: which eval result moves the model up a level, and which safeguard becomes mandatory when it does. If you can trace that loop, you understand the framework — the rest is detail.

Your work today

Read a Scaling Framework

~60 minutes

Read the summary of Anthropic's Responsible Scaling Policy until you can explain capability-tier-gated safeguards (the ASL idea) in your own words, including what triggers a higher tier.
At a high level, compare it with one other lab's framework — OpenAI's Preparedness Framework or DeepMind's Frontier Safety Framework — and note where they agree on the core logic and where they differ.
Honors: write how a concrete eval result from your Part A work would trigger a deployment gate — i.e., turn one of your register's top risks into a pre-committed capability threshold with a tripwire eval and a mandatory safeguard.

The expert move

A novice treats safety decisions as judgment calls made when the model is ready to ship. An expert pre-commits: binds safeguards to measured capability thresholds so the hardest decisions are made in calm, not under deployment pressure. The altitude jump is seeing that the value of a scaling policy isn't the tiers — it's the discipline of deciding the pause condition before the moment you'd be tempted to skip it.

Say this in an interview: "Responsible scaling is just a risk register for catastrophic rows: capability thresholds are the risks, dangerous-capability evals are the detection, required safeguards are the mitigation, and the pause commitment is the teeth. The whole point is pre-commitment — you decide the gate when no one's racing, so the answer holds when someone is."

Today's Takeaways

Responsible scaling policies tie required safeguards to measured capability thresholds, decided in advance.
Dangerous-capability evals are the tripwires; crossing a threshold triggers the next tier's requirements.
The teeth are the commitment to pause if safeguards aren't ready — a framework without one is a press release.
It's a risk register for catastrophic rows: threshold = risk, eval = detection, safeguard = mitigation, gate = owner's call.