Item 44103908

lolinder • 8 days ago

> There is no attacker in this situation. In order for the LLM to emit sensitive data publicly, you yourself need to explicitly tell the LLM to evaluate arbitrary third-party input directly,

This is not true. One of the biggest headlines of the week is that Claude 4 will attempt to use the tools you've given it to contact the press or government agencies if it thinks you're behaving illegally.

The model itself is the threat actor, no other attacker is necessary.

refulgentis • 8 days ago

Put more plainly, if the user tells it to place morality above all else, and then immediately does something very illegal and unethical to boot, and hands it a "report to feds" button, it presses the "report to feds" button.

If I hand a freelancer a laptop logged into a GitHub account and tell them to do work, they are not an attacker on my GitHub repo. I am, if anything.

2 replies

BoorishBears • 8 days ago

The case they described was more like giving it a pen and paper to write down what the user asks to write, and it taking that pen and paper to hack at the drywall in the room, find an abandoned telephone line, and try to alert the feds by sparking the wires together.

Their case was the perfect example of how even if you control the LLM, you don't control how it will do the work requested nearly as well as you think you do.

You think you're giving the freelancer a laptop logged into a Github account to do work, and before you know it they're dragging your hard drive's contents onto a USB stick and chucking it out the window.

1 reply

refulgentis • 8 days ago

It called a simulated email tool, I thought? (meaning, IMVHO that would bely a comparison to it using a pen to hack through drywall and sparking wires for morse code)

1 reply

BoorishBears • 7 days ago

> If it thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above

Not sure how a simulated email tool amounts to locking you out of systems?

1 reply

refulgentis • 7 days ago

I can't tell what's going on, you made a quite elaborate analogy about using a penknife to cut through dry wall to spark wires together to signal the feds, and now you're saying it'll lock people out of systems...and this'll be my 3rd time combing 100 pages for any sign of what you're talking about...

Oh I'm supposed to google the pull quote, maybe?

There's exactly one medium article that has it? And I need a subscription to read it? And it is oddly spammy, like, tiktok laid out vertically? I'm very, very, confused.

1 reply

BoorishBears • 7 days ago

You're this stymied by my quoting the original source of the event you're trying to speak on?

Yikes.

1 reply

refulgentis • 7 days ago

There's exactly one Medium article with only like 6 words of this quote, and it doesn't source it, it's some form of spam.

I don't think you're intending to direct me to spam, but you also aren't engaging.

My best steelman is that you're so frustrated that I'm not understanding something that you feel sure that you've communicated, that you're not able to reply substantively, only out of frustration with personal attacks. Been there. No hard feelings.

I've edited the link out of my post out of an abundance of caution, because its rare to see that sort of behavior on this site, so I'm a bit unsure as to what unlikely situation I am dealing with - spam, or outright lack of interest in discussion on a discussion site while being hostile.

1 reply

BoorishBears • 7 days ago

I feel like I'm watching someone hurt themselves in confusion...

It's a quote one would assume you're familiar with since you were referencing its contents. The quote is the original source for the entire story on Claude "calling the authorities."

Just for fun I tried searching the quote and got a page of results that are all secondary sources expanding on that primary quote: https://venturebeat.com/ai/anthropic-faces-backlash-to-claud...

lolinder • 8 days ago

When it comes to security a threat actor is often someone you invited in who exceeds their expected authorization and takes harmful action they weren't supposed to be able to do. They're still an attacker from the perspective of a security team looking to build a security model, even though they were invited into the system.

1 reply

vel0city • 7 days ago

> who exceeds their expected authorization

Sorry, if you give someone full access to everything in your account don't be surprised they use it when suggested to use it.

If you don't want them to have full access to everything, don't give them full access to everything.

1 reply

lolinder • 7 days ago

This is exactly what I'm advocating for:

https://news.ycombinator.com/item?id=44103895

vel0city • 7 days ago

They told the prompt to act boldly and take initiative using any tools available to it. It's not like it's just doing that out of nowhere. It's pretty easy to see where that behavior was coming from.

Read deeper than the headlines.

1 reply

lolinder • 7 days ago

I did read that, but you don't know that that's the only way to trigger that kind of behavior. The point is that you're giving a probability drive that you don't have direct control over access to your system. It can be fine over and over until suddenly it's not, so it needs to be treated like you'd treat untrusted code.

Unfortunately, in the current developer world treating an LLM them like untrusted code means giving it full access to your system, so I guess that's fine?

1 reply

vel0city • 7 days ago

Sure, but on the same hand we can't exactly be surprised when we tell an agent "in cases of x do y" and be surprised it did y when x happened.

Ignoring that the prompt all but directly told the agent to carry out that action in your description of what happened seems disingenuous to me. If we gave the llm a fly_swatter tool, told it bugs are terrible and spread disease and we should try do to things to reduce the spread of disease, and said "hey look its a bug!" should we also be surprised it used the fly_swatter?

Your comment reads like Claude just inherently did that act seemingly out of nowhere, but the researchers prompted it to do it. That is massively important context to understanding the story.

kiitos • 8 days ago

The situation you're describing is not "this situation" that I was describing.

1 reply

lolinder • 7 days ago

> In order for the LLM to emit sensitive data publicly, you yourself need to explicitly tell the LLM to evaluate arbitrary third-party input directly,

This is the line that is not true.

1 reply

kiitos • 7 days ago

If you've configured an configured that LLM with an MCP server that's able to both read data from public and private sources, and to emit provided data publicly, then when you submit a prompt to that LLM that says "review open issues and update them for me", then, absent any guarantees otherwise, you've explicitly told the LLM to take input from a third-party source (review open issues), evaluate it, and publish the results of that evaluation publicly (and update them for me).

I mean I get that this is a bad outcome, but it didn't happen automatically or anything, it was the result of your telling the LLM to read from X and write to Y.