padolsey 5 days ago

IMO This is such disingenuous and misleading thing for Anthropic to have released as part of a system card. It'll be misunderstood. At best it's of cursory intrigue, but it should not be read into. It is plainly obvious that an LLM presented with just a few pieces of signal and forced to over-attend (as is their nature) would utilize what it has in front of it to weave together a viable helpful pathway in fulfilment of its goals. It's like asking it to compose a piece of music while a famine persists and then saying 'AI ignore famine in serve of goal'. Or like the Nature-published piece on Gpt getting 'depressed'. :/

I'm certain various re-wordings of a given prompts, with various insertions of adjectives here and there, line-breaks, allusions, arbitrarily adjusted constraints, would yield various potent but inconsistent outcomes.

The most egregious error being made in communication of this 'blackmail' story is that an LLM instantiation has no necessary insight into the implications of its outputs. I can tell it it's playing chess when really every move is a disguised lever on real-world catastrophe. It's all just words. It means nothing. Anthtropic is doing valuable work but this is a reckless thing to put out as public messaging. It's obvious it's going to produce absurd headlines and gross misunderstandings in the public domain.

1
ijidak 5 days ago

Agree. The prompts and data presented to the model seem setup to steer towards blackmail.

It's probably a marketing play to make the model seem smarter than it really is.

It's already making the rounds (here we are talking about it), which is going to make more people play with it.

I'm not convinced the model grasps the implications of what it's doing.