Prompt injection is an untrusted-input problem wearing a new costume

The oldest sin in software is mixing instructions with data on the same channel. SQL injection, XSS, format strings: different decades, same shape. Prompt injection is that shape again, with one cruel twist: there is no parser to fix.

In SQL we solved injection with prepared statements, a structural boundary the attacker’s data cannot cross. An LLM has no equivalent boundary. The system prompt, the user’s question, and the malicious text inside a fetched webpage all arrive as the same kind of thing: tokens.

What seems to actually help

My current (revisable!) ranking of defenses, strongest first:

Capability confinement. Assume injection succeeds; make the blast radius boring. The agent that can only read public docs can be fully hijacked and still do no harm. This is the same instinct as cloud-iam-blast-radius: constrain identity, not content.
Trust labeling at the boundary. Track which context came from where, and gate actions (not generations) on provenance.
Detection and filtering. Useful, evadeable, never sufficient alone.

The deeper lesson is structural, which is why this note links into the-attackers-mindset-is-systems-thinking: the vulnerability isn’t in the model, it’s in the composition: model + tools + untrusted content sharing one channel.

Open question I’m tending: does this converge on something like an operating system’s user/kernel split for agents, or is the analogy misleading? Seeds welcome.

Prompt injection is an untrusted-input problem wearing a new costume

What seems to actually help

Paths that lead here

Where this note points