> When talking with reasonable people
When talking with reasonable people, they will tell you if they don't understand what you're saying.
When talking with reasonable people, they will tell you if they don't know the answer or if they are unsure about their answer.
LLMs do none of that.
They will very happily, and very confidently, spout complete bullshit at you.
It is essentially a lotto draw as to whether the answer is hallucinated, completely wrong, subtly wrong, not ideal, sort of right or correct.
An LLM is a bit like those spin the wheel game shows on TV really.
They will also not be offended or harbor ill will when you completely reject their "pull request" and rephrase the requirements.
They will also keep going in circles when you rephrase the requirements, unless with every prompt you keep adding to it and mentioning everything they've already suggested that got rejected. While humans occasionally also do this (hey, short memories), LLMs are infuriatingly more prone to it.
A typical interaction with an LLM:
"Hey, how do I do X in Y?"
"That's a great question! A good way to do X in Y is Z!"
"No, Z doesn't work in Y. I get this error: 'Unsupported operation Z'."
"I apologize for making this mistake. You're right to point out Z doesn't work in Y. Let's use W instead!"
"Unfortunately, I cannot use W for company policy reasons. Any other option?"
"Understood: you cannot use W due to company policy. Why not try to do Z?"
"I just told you Z isn't available in Y."
"In that case, I suggest you do W."
"Like I told you, W is unacceptable due to company policy. Neither W nor Z work."
...
"Let's do this. First, use Z [...]"
It's my experience that once you are in this territory, the LLM is not going to be helpful and you should abandon the effort to get what you want out of it. I can smell blood now when it's wrong; it'll just keep being wrong, cheerfully, confidently.
Yes, to be honest I've also learned to notice when it's stuck in an infinite loop.
It's just frustrating, but when I'm asking it something within my domain of expertise, of course I can notice, and either call it quits or start a new session with a radically different prompt.
Which LLMs and which versions?
All. Of. Them. It's quite literally what they do because they are optimistic text generators. Not correct or accurate text generators.
This really grinds my gears. The technology is inherently faulty, but the relentless optimism of its future subtly hiding that by making it the user's mistake instead.
Oh you got a wrong answer? Did you try the new OpenAI v999? Did you prompt it correctly? Its definitely not the model, because it worked for me once last night..
> it worked for me once last night..
This !
Yeah, it probably "worked for me" because they spent a gazillion hours engaging in what the LLM fanbois call "prompt engineering", but you and I would call "engaging in endless iterative hacky work-arounds until you find a prompt that works".
Unless its something extremely simple, the chances of an LLM giving you a workable answer on the first attempt is microscopic.
Most optimistic text generators do not consider repeating the stuff that was already rejected a desireable path forward. It might be the only path forward they’re aware of though.
In some contexts I got ChatGPT to answer "I don't know" when I crafted a very specific prompt about not knowing being and acceptable and preferable answer to bullshitting. But it's hit and miss, and doesn't always work; it seems LLMs simply aren't trained to model admittance of ignorance, they almost always want to give a positive and confident answer.
You can use prompts to fix some of these problematic tendencies.
I think you are a couple of years out of date.
No longer an issue with the current SOTA reasoning models.
Throwing more parameters at the problem does absolutely nothing to fix the hallucination and bullshit issue.
Correct and it wasn’t fixed with more parameters. Reasoning models question their own output, and all of the current models can verify their sources online before replying. They are not perfect, but they are much better than they used to be, and it is practically not an issue most of the time. I have seen the reasoning models correct their own output while it is being generated. Gemini 2.5 Pro, GPT-o3, Grok 3.