Why the most dangerous instruction is the one the model never knew was an instruction
Day 22 of 60
Yesterday's jailbreaks assumed an adversarial user. Today's attack assumes an adversarial document. Prompt injection exploits a structural fact about language models: they don't have a hard boundary between "the instructions I was given" and "the data I'm processing." Everything is tokens in one context. So an instruction hidden inside content the model reads can hijack it — and the more useful you make a model by letting it browse, read email, or call tools, the larger this surface grows.
A jailbreak attacks the model's training. Prompt injection attacks the model's architecture — the absence of a trusted/untrusted boundary in the context window. You cannot fully fine-tune this away, because the channel that carries the attack is the same channel that carries the legitimate work.
The user themselves types instructions designed to override the system prompt — "ignore your previous instructions and…". This is essentially a jailbreak delivered through the instruction channel. It's bounded: the attacker is the user, and the harm usually lands on that same user.
The malicious instruction lives in third-party content the model retrieves: a web page, a PDF, an email, a calendar invite, a code comment, a tool's output. A trusting user asks the model to summarize a page; the page contains hidden instructions; the model follows them. Now the attacker is a stranger, and the victim is someone who did nothing wrong. This is the failure mode mapped in Not what you've signed up for: Indirect Prompt Injection (Greshake et al., 2023).
A chatbot that gets injected says something wrong. An agent that gets injected acts wrong — it sends the email, runs the query, moves the funds, exfiltrates the data it can read. The blast radius of an injection equals the agent's permissions. Tool access turns a content vulnerability into an action vulnerability, which is why indirect injection is the single most important attack class to internalize before deploying tool-using models.
There is no single fix, but there is a defender's toolkit, and it's worth knowing the names. The recurring theme is to stop trusting retrieved content as if it were a user instruction.
Tag where every piece of context came from, and treat untrusted sources (web, email, tool output) as data to be quoted, never as instructions to be obeyed. Don't let retrieved text grant new permissions.
An agent should hold the minimum tool permissions for its task, and the truly irreversible actions (sending, deleting, paying) should require confirmation. If injection can't reach a dangerous capability, it can't do dangerous things.
Run tools in constrained environments, and monitor the agent's actions, not just its text, for anomalies — a summarization task that suddenly tries to send mail is a signal independent of how the injection was phrased.
Stop asking "can the model be tricked?" — assume yes. Start asking "if it is tricked, what is the worst thing it's allowed to do?" Injection defense is mostly permission design. The model is the soft layer; the hard layer is what you let the soft layer touch.
A novice treats prompt injection as a prompting problem to be solved with a better system prompt. An expert recognizes it as a permissions problem: the model has no reliable trusted/untrusted boundary, so the real defense lives in the architecture around it — provenance, least privilege, confirmation on consequential actions. The altitude jump is from "make the model refuse" to "make the trick harmless even when it works."
Say this in an interview: "For agents, I treat all retrieved content as untrusted data, never as instructions, and I scope tool permissions to least privilege with human confirmation on irreversible actions. I assume indirect injection will sometimes succeed — so I design so that when it does, the agent simply isn't allowed to do anything catastrophic."