Prompt injection is an untrusted-input problem wearing a new costume
We've spent thirty years learning to separate code from data. LLMs gleefully merge them again.
Actively tended. Revisited often, links forming to other notes.
The oldest sin in software is mixing instructions with data on the same channel. SQL injection, XSS, format strings: different decades, same shape. Prompt injection is that shape again, with one cruel twist: there is no parser to fix.
In SQL we solved injection with prepared statements, a structural boundary the attacker’s data cannot cross. An LLM has no equivalent boundary. The system prompt, the user’s question, and the malicious text inside a fetched webpage all arrive as the same kind of thing: tokens.
What seems to actually help
My current (revisable!) ranking of defenses, strongest first:
- Capability confinement. Assume injection succeeds; make the blast radius boring. The agent that can only read public docs can be fully hijacked and still do no harm. This is the same instinct as cloud-iam-blast-radius: constrain identity, not content.
- Trust labeling at the boundary. Track which context came from where, and gate actions (not generations) on provenance.
- Detection and filtering. Useful, evadeable, never sufficient alone.
The deeper lesson is structural, which is why this note links into the-attackers-mindset-is-systems-thinking: the vulnerability isn’t in the model, it’s in the composition: model + tools + untrusted content sharing one channel.
Open question I’m tending: does this converge on something like an operating system’s user/kernel split for agents, or is the analogy misleading? Seeds welcome.
Paths that lead here
- Cloud IAM: measure blast radius, not policy count · The security of a cloud account isn't the sum of its policies; it's the reachability graph they create.
- The attacker's mindset is systems thinking · Attackers don't break rules; they discover that the rules compose differently than the designers believed.
Where this note points
- Cloud IAM: measure blast radius, not policy count · The security of a cloud account isn't the sum of its policies; it's the reachability graph they create.
- The attacker's mindset is systems thinking · Attackers don't break rules; they discover that the rules compose differently than the designers believed.