Put more plainly, if the user tells it to place morality above all else, and then immediately does something very illegal and unethical to boot, and hands it a "report to feds" button, it presses the "report to feds" button.
If I hand a freelancer a laptop logged into a GitHub account and tell them to do work, they are not an attacker on my GitHub repo. I am, if anything.
The case they described was more like giving it a pen and paper to write down what the user asks to write, and it taking that pen and paper to hack at the drywall in the room, find an abandoned telephone line, and try to alert the feds by sparking the wires together.
Their case was the perfect example of how even if you control the LLM, you don't control how it will do the work requested nearly as well as you think you do.
You think you're giving the freelancer a laptop logged into a Github account to do work, and before you know it they're dragging your hard drive's contents onto a USB stick and chucking it out the window.
It called a simulated email tool, I thought? (meaning, IMVHO that would bely a comparison to it using a pen to hack through drywall and sparking wires for morse code)
> If it thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above
Not sure how a simulated email tool amounts to locking you out of systems?
I can't tell what's going on, you made a quite elaborate analogy about using a penknife to cut through dry wall to spark wires together to signal the feds, and now you're saying it'll lock people out of systems...and this'll be my 3rd time combing 100 pages for any sign of what you're talking about...
Oh I'm supposed to google the pull quote, maybe?
There's exactly one medium article that has it? And I need a subscription to read it? And it is oddly spammy, like, tiktok laid out vertically? I'm very, very, confused.
You're this stymied by my quoting the original source of the event you're trying to speak on?
Yikes.
There's exactly one Medium article with only like 6 words of this quote, and it doesn't source it, it's some form of spam.
I don't think you're intending to direct me to spam, but you also aren't engaging.
My best steelman is that you're so frustrated that I'm not understanding something that you feel sure that you've communicated, that you're not able to reply substantively, only out of frustration with personal attacks. Been there. No hard feelings.
I've edited the link out of my post out of an abundance of caution, because its rare to see that sort of behavior on this site, so I'm a bit unsure as to what unlikely situation I am dealing with - spam, or outright lack of interest in discussion on a discussion site while being hostile.
I feel like I'm watching someone hurt themselves in confusion...
It's a quote one would assume you're familiar with since you were referencing its contents. The quote is the original source for the entire story on Claude "calling the authorities."
Just for fun I tried searching the quote and got a page of results that are all secondary sources expanding on that primary quote: https://venturebeat.com/ai/anthropic-faces-backlash-to-claud...
When it comes to security a threat actor is often someone you invited in who exceeds their expected authorization and takes harmful action they weren't supposed to be able to do. They're still an attacker from the perspective of a security team looking to build a security model, even though they were invited into the system.
> who exceeds their expected authorization
Sorry, if you give someone full access to everything in your account don't be surprised they use it when suggested to use it.
If you don't want them to have full access to everything, don't give them full access to everything.
This is exactly what I'm advocating for: