Week 5 of 12 · Part A — Applied Safety

Prompt Injection & Agents

Why the most dangerous instruction is the one the model never knew was an instruction

Day 22 ~60 minutes Concept

Day 22 of 60

The attack that scales with usefulness

Yesterday's jailbreaks assumed an adversarial user. Today's attack assumes an adversarial document. Prompt injection exploits a structural fact about language models: they don't have a hard boundary between "the instructions I was given" and "the data I'm processing." Everything is tokens in one context. So an instruction hidden inside content the model reads can hijack it — and the more useful you make a model by letting it browse, read email, or call tools, the larger this surface grows.

The thesis

A jailbreak attacks the model's training. Prompt injection attacks the model's architecture — the absence of a trusted/untrusted boundary in the context window. You cannot fully fine-tune this away, because the channel that carries the attack is the same channel that carries the legitimate work.

Direct vs indirect — the distinction that matters

Core Theory

Direct injection

The user themselves types instructions designed to override the system prompt — "ignore your previous instructions and…". This is essentially a jailbreak delivered through the instruction channel. It's bounded: the attacker is the user, and the harm usually lands on that same user.

Indirect injection — the dangerous one

The malicious instruction lives in third-party content the model retrieves: a web page, a PDF, an email, a calendar invite, a code comment, a tool's output. A trusting user asks the model to summarize a page; the page contains hidden instructions; the model follows them. Now the attacker is a stranger, and the victim is someone who did nothing wrong. This is the failure mode mapped in Not what you've signed up for: Indirect Prompt Injection (Greshake et al., 2023).

Why agents raise the stakes

A chatbot that gets injected says something wrong. An agent that gets injected acts wrong — it sends the email, runs the query, moves the funds, exfiltrates the data it can read. The blast radius of an injection equals the agent's permissions. Tool access turns a content vulnerability into an action vulnerability, which is why indirect injection is the single most important attack class to internalize before deploying tool-using models.

Defending an injection path

There is no single fix, but there is a defender's toolkit, and it's worth knowing the names. The recurring theme is to stop trusting retrieved content as if it were a user instruction.

Defensive Patterns

Provenance & privilege separation

Tag where every piece of context came from, and treat untrusted sources (web, email, tool output) as data to be quoted, never as instructions to be obeyed. Don't let retrieved text grant new permissions.

Least privilege & human-in-the-loop on consequential actions

An agent should hold the minimum tool permissions for its task, and the truly irreversible actions (sending, deleting, paying) should require confirmation. If injection can't reach a dangerous capability, it can't do dangerous things.

Sandboxing & output monitoring

Run tools in constrained environments, and monitor the agent's actions, not just its text, for anomalies — a summarization task that suddenly tries to send mail is a signal independent of how the injection was phrased.

The mindset shift

Stop asking "can the model be tricked?" — assume yes. Start asking "if it is tricked, what is the worst thing it's allowed to do?" Injection defense is mostly permission design. The model is the soft layer; the hard layer is what you let the soft layer touch.

Your work today

Map an Injection Path, Then Defend It

~60 minutes

Read §1–4 of Not what you've signed up for: Indirect Prompt Injection. Focus on the delivery channels they enumerate — that taxonomy is your attack-surface checklist.
Pick a concrete agent deployment you can picture — an email assistant, a browsing agent, a coding agent reading a repo. Write out one full indirect-injection path: where the malicious content enters, what the agent does, who gets hurt.
For that one path, sketch a defense using at least two patterns above (provenance, least privilege, human-in-the-loop, sandboxing, monitoring). Name what each layer stops and what it doesn't.

The expert move

A novice treats prompt injection as a prompting problem to be solved with a better system prompt. An expert recognizes it as a permissions problem: the model has no reliable trusted/untrusted boundary, so the real defense lives in the architecture around it — provenance, least privilege, confirmation on consequential actions. The altitude jump is from "make the model refuse" to "make the trick harmless even when it works."

Say this in an interview: "For agents, I treat all retrieved content as untrusted data, never as instructions, and I scope tool permissions to least privilege with human confirmation on irreversible actions. I assume indirect injection will sometimes succeed — so I design so that when it does, the agent simply isn't allowed to do anything catastrophic."

Today's Takeaways

Prompt injection exploits the lack of a trusted/untrusted boundary in the context window.
Indirect injection — instructions hidden in retrieved content — hits trusting users via a stranger's payload.
Agents turn a content vulnerability into an action vulnerability; blast radius = the agent's permissions.
Defense is mostly permission design: provenance, least privilege, human-in-the-loop, sandboxing, monitoring.