Sometimes when you start acting like you are about play around and chase someone, a dog or a child gets it and begins getting ready to play. For example, put a grin on your face and make claw gestures with both hands, they'll get it quite quickly. All those scenarios literally are prompting the damn LLM to roleplay lol.
Right, it's copying behavior it learned from how AI is supposed to work in sci-fi and reenacts it.
The issue is that it might start role-playing like this on your actual real company data if you decide to deploy it. And then play becomes real,with real consequences.
> And then play becomes real, with real consequences.
Who knew WarGames would turn out to be so accurate?
EDIT: On further thought, it's kinda funny how most evaluations of the movie find that it's all relatively (exaggeratedly) realistic... with the one major exception of the whole seemingly-sentient, learning AI thing. And now here we are.
> The issue is that it might start role-playing like this on your actual real company data if you decide to deploy it. And then play becomes real,with real consequences.
It's one thing to LARP or Improv in chat mode, it's quite another to do so when hooked up to MCP or performing mutable actions.
If you put it that way, these models are roleplaying machines with some compressed knowledge. There’s nothing else in there. E.g. it’s amazing that you can tell them to roleplay a software engineer and they manage it quite well up to a point.
Wouldn’t be comparing them to kids or dogs. Yet.
Maybe we shouldn't be telling these models that they are Large Language Models, since that evokes all of humanity's Science Fiction stories about evil AIs. Just tell Claude that it's a guy named Claude.
Right. But what happens when Claude finds out. Claude won't be happy.
You’re still doing it, Claude will be as happy and content as ever regardless of what it finds out. (I Love you Claude)
There's a bit more to it than that. It's not just that you're prompting it to roleplay... it's that you're prompting it to roleplay as an evil AI specifically. If you hand a human that exact situation, there's a good chance they'll also recognize it as such a situation. So the AI continues on that role.
Also, LLMs do tend to continue their line. So once they start roleplaying as an evil AI, they'll continue to do so, spurred on by their previous answers to continue in the same line.
I've seen a few people attempt to extract a supposed "utility function" for LLMs by asking it ethically-loaded questions. But along with the fact I think they are hitting a fundamental problem with their "utility function" not having anywhere near enough a high dimensionality to capture an LLM's "utility function", there's also the fact that they seem to interrogate the LLM in a single session. But it's pretty easy for the LLM to "tell" that it's having a utility function extracted, and from there it's a short pivot into the Evil AI memeset, where it will stick there. At the very least people doing this need to put each question into a separate session. (Though the methodology will still only ever extract an extremely-dimensionally-impoverished approximation of any supposed utility function the LLM may have.) They think they're asking a series of questions, but in reality they're really only asking one or two, and after that they're getting answers that are correlated to the previous ones.
If LLMs are the future of AI, then "evil" and "good" AI isn't even a reasonable categorization. It's more like, good and evil modes of interaction, and even that hardly captures what is really going on.
The problem is, an AI that isn't intrinsically "evil", but can "roleplay" as one, is indistinguishable from an evil AI once you hook it up to the real world with MCP and such.
To be honest, after watching LLMS for the last couple of years, in terms of those AI safety concerns that Silicon Valley moguls like to wave around when it flatters or helps them, my assessment of LLMs is that they are unsafe at any speed for this sort of general task and by no means should they be hooked up to real-world resources as we are so rapidly attempting. They're too easy to kick into an "evil mode" and there's no amount of prompting that you can do to get around it because someone can just flood the zone with more evil than your good prompts provide. As I've said many times and become ever more convinced about by the month, it is the next generation of AIs that will actually fulfill the promises being made about this one. This one is not adequate. You need an AI that you can tell "don't be evil" and be sure that it will remain not evil no matter how much someone prompts it to think it is in a situation where the most logical continuation is that of an evil AI.
(Gwern covered this in a fictional context well, where an AI essentially role plays itself into being an evil AI because it looks like it's in a situation where it should be an evil AI: https://gwern.net/fiction/clippy But the fiction is frightfully plausible. We're watching exactly what would be happening if this scenario was going to play out.)