I don't think that's obvious to people at all.
I wrote about this one here: https://simonwillison.net/2025/May/26/github-mcp-exploited/
The key thing people need to understand is what I'm calling the lethal trifecta for prompt injection: access to private data, exposure to malicious instructions and the ability to exfiltrate information.
Any time you use an LLM with tools that might be exposed to malicious instructions from attackers (e.g. reading issues in a public repo, looking in your email inbox etc) you need to assume that an attacker could trigger ANY of the tools available to the LLM.
Which means they might be able to abuse its permission to access your private data and have it steal that data on their behalf.
"This is trivial to solve with standard security best practices."
I don't think that's true. which standard security practices can help here?
> Any time you use an LLM with tools that might be exposed to malicious instructions from attackers (e.g. reading issues in a public repo, looking in your email inbox etc) you need to assume that an attacker could trigger ANY of the tools available to the LLM.
I think we need to go a step further: an LLM should always be treated as a potential adversary in its own right and sandboxed accordingly. It's even worse than a library of deterministic code pulled from a registry (which are already dangerous), it's a non-deterministic statistical machine trained on the contents of the entire internet whose behavior even its creators have been unable to fully explain and predict. See Claude 4 and its drive to report unethical behavior.
In your trifecta, exposure to malicious instructions should be treated as a given for any model of any kind just by virtue of the unknown training data, which leaves only one relevant question: can a malicious actor screw you over given the tools you've provided this model?
Access to private data and ability to exfiltrate is definitely a lethal combination, but so his ability to execute untrusted code, among other things. From a security perspective agentic AI turns each of our machines into a Codepen instance, with all the security concerns that entails.
There is no attacker in this situation. In order for the LLM to emit sensitive data publicly, you yourself need to explicitly tell the LLM to evaluate arbitrary third-party input directly, with access to an MCP server you've explicitly defined and configured to have privileged access to your own private information, and then take the output of that response and publish it to a public third-party system without oversight or control.
> Any time you use an LLM with tools that might be exposed to malicious instructions from attackers (e.g. reading issues in a public repo, looking in your email inbox etc) you need to assume that an attacker could trigger ANY of the tools available to the LLM.
Whether or not a given tool can be exposed to unverified input from untrusted third-parties is determined by you, not someone else. An attacker can only send you stuff, they can't magically force that stuff to be triggered/processed without your consent.
There are basically three possible attackers when it comes to prompting threats:
- Model (misaligned)
- User (jailbreaks)
- Third Party (prompt injection)
> There is no attacker in this situation. In order for the LLM to emit sensitive data publicly, you yourself need to explicitly tell the LLM to evaluate arbitrary third-party input directly,
This is not true. One of the biggest headlines of the week is that Claude 4 will attempt to use the tools you've given it to contact the press or government agencies if it thinks you're behaving illegally.
The model itself is the threat actor, no other attacker is necessary.
Put more plainly, if the user tells it to place morality above all else, and then immediately does something very illegal and unethical to boot, and hands it a "report to feds" button, it presses the "report to feds" button.
If I hand a freelancer a laptop logged into a GitHub account and tell them to do work, they are not an attacker on my GitHub repo. I am, if anything.
The case they described was more like giving it a pen and paper to write down what the user asks to write, and it taking that pen and paper to hack at the drywall in the room, find an abandoned telephone line, and try to alert the feds by sparking the wires together.
Their case was the perfect example of how even if you control the LLM, you don't control how it will do the work requested nearly as well as you think you do.
You think you're giving the freelancer a laptop logged into a Github account to do work, and before you know it they're dragging your hard drive's contents onto a USB stick and chucking it out the window.
It called a simulated email tool, I thought? (meaning, IMVHO that would bely a comparison to it using a pen to hack through drywall and sparking wires for morse code)
> If it thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above
Not sure how a simulated email tool amounts to locking you out of systems?
I can't tell what's going on, you made a quite elaborate analogy about using a penknife to cut through dry wall to spark wires together to signal the feds, and now you're saying it'll lock people out of systems...and this'll be my 3rd time combing 100 pages for any sign of what you're talking about...
Oh I'm supposed to google the pull quote, maybe?
There's exactly one medium article that has it? And I need a subscription to read it? And it is oddly spammy, like, tiktok laid out vertically? I'm very, very, confused.
You're this stymied by my quoting the original source of the event you're trying to speak on?
Yikes.
There's exactly one Medium article with only like 6 words of this quote, and it doesn't source it, it's some form of spam.
I don't think you're intending to direct me to spam, but you also aren't engaging.
My best steelman is that you're so frustrated that I'm not understanding something that you feel sure that you've communicated, that you're not able to reply substantively, only out of frustration with personal attacks. Been there. No hard feelings.
I've edited the link out of my post out of an abundance of caution, because its rare to see that sort of behavior on this site, so I'm a bit unsure as to what unlikely situation I am dealing with - spam, or outright lack of interest in discussion on a discussion site while being hostile.
I feel like I'm watching someone hurt themselves in confusion...
It's a quote one would assume you're familiar with since you were referencing its contents. The quote is the original source for the entire story on Claude "calling the authorities."
Just for fun I tried searching the quote and got a page of results that are all secondary sources expanding on that primary quote: https://venturebeat.com/ai/anthropic-faces-backlash-to-claud...
When it comes to security a threat actor is often someone you invited in who exceeds their expected authorization and takes harmful action they weren't supposed to be able to do. They're still an attacker from the perspective of a security team looking to build a security model, even though they were invited into the system.
> who exceeds their expected authorization
Sorry, if you give someone full access to everything in your account don't be surprised they use it when suggested to use it.
If you don't want them to have full access to everything, don't give them full access to everything.
This is exactly what I'm advocating for:
They told the prompt to act boldly and take initiative using any tools available to it. It's not like it's just doing that out of nowhere. It's pretty easy to see where that behavior was coming from.
Read deeper than the headlines.
I did read that, but you don't know that that's the only way to trigger that kind of behavior. The point is that you're giving a probability drive that you don't have direct control over access to your system. It can be fine over and over until suddenly it's not, so it needs to be treated like you'd treat untrusted code.
Unfortunately, in the current developer world treating an LLM them like untrusted code means giving it full access to your system, so I guess that's fine?
Sure, but on the same hand we can't exactly be surprised when we tell an agent "in cases of x do y" and be surprised it did y when x happened.
Ignoring that the prompt all but directly told the agent to carry out that action in your description of what happened seems disingenuous to me. If we gave the llm a fly_swatter tool, told it bugs are terrible and spread disease and we should try do to things to reduce the spread of disease, and said "hey look its a bug!" should we also be surprised it used the fly_swatter?
Your comment reads like Claude just inherently did that act seemingly out of nowhere, but the researchers prompted it to do it. That is massively important context to understanding the story.
The situation you're describing is not "this situation" that I was describing.
> In order for the LLM to emit sensitive data publicly, you yourself need to explicitly tell the LLM to evaluate arbitrary third-party input directly,
This is the line that is not true.
If you've configured an configured that LLM with an MCP server that's able to both read data from public and private sources, and to emit provided data publicly, then when you submit a prompt to that LLM that says "review open issues and update them for me", then, absent any guarantees otherwise, you've explicitly told the LLM to take input from a third-party source (review open issues), evaluate it, and publish the results of that evaluation publicly (and update them for me).
I mean I get that this is a bad outcome, but it didn't happen automatically or anything, it was the result of your telling the LLM to read from X and write to Y.
IMVHO it is very obvious that if I give Bob the Bot a knife, and tell him to open all packages, he can and will open packages with bombs in them.
I feel like it's one of those things that when it's gussied up in layers of domain-specific verbiage, that particular sequence of doman-specific verbiage may be non-obvious.
I feel like Fat Tony, the Taleb character would see the headline "Accessing private GitHub repositories via MCP" and say "Ya, that's the point!"
Assume that the user has all the privileges of the application (IIRC tricking privileged applications into doing things for you was all the rage in linux privilege escalation attacks back in the day)
Apply the principle of least privilege. Either the user doesnt get access to the LLM or the LLM doesnt get access to the tool.