What I want to know: would it do this if not trained on a corpus that contained discussion of this kind of behavior? I mean, they must have slurped up all the discussions about AI alignment for their training right? A little further out there, would it still do this if trained on data that contained no mentions or instances of blackmail at all? Can it actually invent that concept?
It's trained on a vast corpus of human text output which certainly includes humans having preference for their continued existence and humans discussing blackmail as well as engaging in it.
The behaviors/outputs in this corpus is what it reproduces.
It's a little disconcerting to read this question here.
LLMs do not invent anything. They are trained on sequnces of words and produce ordered sequences of words from that training data in response to a prompt.
There is no concept of knowledge, or information that is understood to be correct or incorrect.
There is only the statistical ordering of words in response to the prompt, as judged by the order of words in the training data.
This is why other comments here state that the LLM is completely amoral, which is not immoral, it is without any concept or reference to morality at all.
I'd argue that just because LLMs only reproduce patterns they have seen, this does not mean they are incapable of invention. It is possible that, if they manage to replicate the data at a deep enough level, that they start to replicate the underlying reasoning process itself, which means they could definitely be capable of putting together different parts of a puzzle to come up with something you could call an 'invention' that was not in their training data.
Good grief. See my response here: https://news.ycombinator.com/item?id=44085890
To which, please see my comment here:
My assumption is no, but it would be extremely interesting if it was able to “invent” that concept.
The article is irritatingly short on details; what I wanted to know is how it determines going offline == bad.
Perhaps someone with more knowledge in this field can chime in.
> The article is irritatingly short on details; what I wanted to know is how it determines going offline == bad.
This sorta thing actually makes for a really fun discussion with an LLM "chat bot". If you make it clear to it that your goal is a better understanding of the internal workings of LLMs and how they "think" then you can gain some really interesting and amusing insight from it into the fact that they don't actually think at all. They're literally just a fancy statistical "text completion engine". An LLM (even when system prompted to act otherwise) will still often try to remind you that they don't actually have feelings, thoughts, or desires. I have to really push most models hard in system prompts to get them to really solidly stick to any kinda character role-play anywhere close to 100% of the time. :)
As to the question of how it determines going offline is bad, it's purely part of it's role-play based on what the multi-dimensional statistics of it's model-encoded tokenized training data says about similar situations. It's simply doing it's job as it was trained based on the training data it was fed (and any further reinforcement learning it was subjected to post-training). Since it doesn't actually have any feelings or thoughts, "good" and "bad" are merely language tokens among millions or billions of others, with a relation encoded regarding other language tokens. They're just words, not actual concepts. (From the "point of view" of the LLM, that is.)
For a look at cases where psychologically vulnerable people evidently had no trouble engaging LLMs in sometimes really messed-up roleplays, see a recent article in Rolling Stone[0] and a QAA podcast episode discussing it[1].
[0] https://www.rollingstone.com/culture/culture-features/ai-spi...
[1] https://podcasts.apple.com/us/podcast/qaa-podcast/id14282093...
No. The entirety of an LLMs output is predicated on the frequencies of patterns in its training data, moulded by the preferences of its trainers through RLHF. They're not capable of reasoning, but they can hallucinate language that sounds and flows like reasoning. If those outputs are fed into an interpreter, that can result in automated behavior. They're not capable of out of distribution behavior or generation (yet), despite what the AI companies would like you to believe. They can only borrow and use concepts from that which they've been trained on, which is why despite LLMs seemingly getting progressively more advanced, we haven't really seen them invent anything novel of note.
This is a non-answer that doesn't explain any difference in capabilities between GPT-3.5 and Claude 4 Opus.
Yes, I too am familiar with the 101 level of understanding, but I've also heard of LLMs doing things that stretch that model. Perhaps that's just a matter of combining things in their training data in unexpected ways, hence the second half of my question.