What I find really strange about this is I use AI a lot as a “smart friend” to work through explanations of things I find difficult etc and I am currently preparing for some exams so I will often give the AI a document and ask for some supporting resources to take the subject further and it almost always produces something that is plausibly close to a real thing but wrong in specifics. As in when you ask for a reference it is almost invariably a hallucination. So it just amazes me that anyone would just stick that in a brief and ship it without checking it even more than they would check the work of a human underling (which they should obviously also check for something this important).
For example, yesterday I got a list of some study resources for abstract algebra. Claude referred me to a series by Benedict Gross (Which is excellent btw). It gave me a line to harvard’s website but it was a 404 and it was only with further searching that I found the real thing. It also suggested a youtube playlist by Socratica (again this exists but the url was wrong) and one by Michael Penn (same deal).
Literally every reference was almost right but actually wrong. How does anyone have the confidence to ship a legal brief that an AI produced without checking it thoroughly?
I think it's easy to understand why people are overestimating the accuracy and performance of LLM-based output: it's currently being touted as the replacement for human labor in a large number of fields. Outside of software development there are fewer optimistic skeptics and much less nuanced takes on the tech.
Casually scrolling through TechCrunch I see over $1B in very recent investments into legal-focused startups alone. You can't push the messaging that the technology to replace humans is here and expect people will also know intrinsically that they need to do the work of checking the output. It runs counter to the massive public rollout of these products which have a simple pitch: we are going to replace the work of human employees.
I take the charitable view that some high profile people painted themselves into a corner very publicly. I think they made an estimation that they could work out the kinks as they went, but it's becoming apparent that the large early gains were not sustainable. And now there appears to be some very fundamental limitations to what this architecture can achieve and everyone involved has basically little option other than to keep doubling down. I expect this to blow up spectacularly pretty soon.
People are lazy. I’m enrolled in a language class in a foreign country right now - so presumably people taking that class want to actually get good at the language so they can actually live their life here - yet a significant portion of students just turn in ChatGPT essays.
And I don’t mean essays edited with chatGPT, but essays that are clearly verbatim output. When the teacher asks the students to read them out loud to the class, they will stumble upon words and grammar that are way obviously way beyond anything we’ve studied. The utter lack of self awareness is both funny but also really sad.
I’m a programmer who speaks English and Japanese as foreign languages. My colleagues have access to LLMs, and before that google translate and DeepL. Yet still I am asked a lot about my cultural opinion on how to approach the Japanese side, even in English. Perhaps this serves as a hint to you that robotic translation is not everything.
There are a lot of shit tier lawyers who are just in it for the money and just barely passed their exams. Given his notoriety, Lindell is scraping the bottom of the barrel with people willing to provide legal services.
could it be that they just have to attend the class for technical reasons? Also - once the gadgets can translate for free in real time ... you can live in places you don't speak the language of, so maybe they are just prepping for that.
LLMs were originally designed for translation, so it makes sense. We have basically elimated the need to learn foreign languages for day to day use anyway, its only helpful for high professional tasks or close literary study or prestige.
That’s such a ridiculous statement.
Yep.
> [learning a language is] only helpful for high professional tasks or close literary study or prestige.
This is a person who doesn't understand actual face-to-face communication, like, at all. Even though translation apps are amazing, in a social interaction, there's no getting over the imposition of the halting, hesitant back-and-forth of device-assisted translation. Sure, you can almost always eventually get your point across, but you're never going to set the other party at ease in the same way as speaking their language yourself.
Sure, but how often does that happen in most people’s lives, especially in the US?
When they’re on vacation? Very few people are going to learn a language that they could use for a week or two in a place where people probably speak English better than whatever language you’re attempting anyways.
Obviously there are exceptions.
What this story tells us more than anything is that Lindell cannot convince a competent lawyer to defend him, so what he gets instead are clownshod phonies. Either he’s out of cash, or he’s such a terrible client that nobody with a shred of professional responsibility will take him.
This. And only this.
My social studies teacher in 8th grade throughout the year would give us a list of things and phrases by decade. Loved these assignments. Totally threw myself into them. Years later, I was working at a prep school and wanted the students to be assigned Billy Joel's We Didn't Start the Fire as a similar assignment (competition?). Teachers thought it was stupid, that the kids wouldn't like it (um, who cares?), and that it was too hard. Teacher's responses only confirmed what I thought about the school (meh) and that it was a sad day that curiosity was a bad thing. (I was the fundraiser for the school so I didn't really interact with the kids a lot but ones I knew would have had fun with such a project).
Anyway, the Lindell lawyers must have gone to this school, or one like it. How is it ever okay to do this and think it's a good idea? And, how the heck did these people pass the Bar?
Edit: List of references in We Didn't Start the Fire https://en.wikipedia.org/wiki/List_of_references_in_We_Didn%.... Gonna blog the references and this post on my little policy blog this week :-)
If you enjoyed Billy Joel’s original, your college-level Medieval History classes are gonna love Hildegard von Blingin’!
> How does anyone have the confidence to ship a legal brief that an AI produced without checking it thoroughly?
It has likely never occurred to them that such checks are necessary. Why would it, if they've never performed such checks, nor happen to have been warned by AI critics?
>For example, yesterday I got a list of some study resources for abstract algebra. Claude referred me to a series by Benedict Gross (Which is excellent btw). It gave me a line to harvard’s website but it was a 404 and it was only with further searching that I found the real thing. It also suggested a youtube playlist by Socratica (again this exists but the url was wrong) and one by Michael Penn (same deal).
FWIW, I've found Penn's content to be quite long-winded and poorly edited. The key idea being presented often makes up hardly any of the video's runtime, so I'm just sitting there watching the guy actually write out the steps solving an equation (and making trivial errors, and not always correcting them).
I asked ChatGPT to give Wikipedia links in a table. Not one of the 50+ links was valid.
Which version of GPT? I've found that 4o has actually been quite good at this lately, rarely hallucinating links any more.
Just two days ago, I gave it a list of a dozen article titles from a newspaper website (The Guardian), asked it to look up their URLs and give me a list, and to summarise each article for me, and it made no mistakes at all.
Maybe your task was more complicated to do in some way, maybe you're not paying for ChatGPT and are on a less able model, or maybe it's a question of learning how to prompt, I don't know, I just know that for me it's gone from "assume sources cited are bullshit" to "verify each one still, but they're usually correct".
> asked it to look up their URLs and give me a list
Something missing from this conversation is whether we're talking about the raw model or model+tool calls (search). This sounds like tool calls were enabled.
And I do think this is a sign that the current UX of the chatbots is deeply flawed: even on HN we don't seem to interact with the UI components to toggle these features frequently enough that they're the intuitive answer, instead we still talk about model classes as though that makes the biggest difference in accuracy.
Ah, yes you're right - I didn't clarify this in my original comment, but my anecdote was indeed the ChatGPT interface and using its ability to browse the web[#], not expecting it to pull URLs out of its original training data. Thanks for pointing that out.
But the reason I suggested model as a potential difference between me and the person I replied to, rather than ChatGPT interface vs. plain use of model without bells and whistles, is that they had said their trouble was while using ChatGPT, not while using a GPT model over the API or through a different service.
[#] (Technically I didn't, and never do, have the "search" button enabled in the chat interface, but it's able to search/browse the web without that focus being selected.)
Right, but ChatGPT doesn't always automatically use search. I don't know what mechanisms it uses to decide whether to turn that on (maybe free accounts vs paid makes a difference?) but I rarely see it automatically turn on search, it usually tries to respond directly from weights.
And on the flip side, my local Llama 3 8b does a pretty good job at avoiding hallucinations when it's hooked up to search (through Open WebUI). Search vs no-search seems to me to matter far more than model class.
I'm just specific in my prompting, rather than letting it decide whether or not to search.
These models aren't (yet, at least) clever enough to understand what they do or don't know, so if you're not directly telling them when you want them to go and find specific info rather than guess at it you're just asking a mystic with a magic ball.
It doesn't add much to the length of prompts, just a matter of getting in the habit of wording things the right way. For the request I gave as my example a couple of comments above, I wrote "Please search for every one of the Guardian articles whose titles I pasted above and give me a list of URLs for them all." whereas if you write "Please tell me the URLs of these Guardian articles" then it may well act as if it knows them already and return bullshit.
Definitely more complicated. I've been playing around with using it to analyze historical data and using it to generate charts. And yes I've tried many different kinds of phrasing. I have experience working with and writing rules based "expert systems" and have a vague idea of how neural networks are used for image recognition. It's a pretty fun game to get useful information out of ChatGPT.
You cannot ask it to have crop yield as a column in a chart and get accurate information.
It only seems reasonable when doing a single list of items. Asking it for two columns of data and it starts making things up. Like bogus wikipedia links.
You could definitely make the argument I'm using it wrong but this is how people try to use it. I still find this useful because it gives me a start on where to point my research or ask clarifying questions.
It's much better at giving you a list of types of beer and wine that's been produced in history. Just don't trust any of the dates.
If you could share the actual prompts & info you wanted I would be curious to try and see if it is indeed too complex for it or if prompting differently would work better, because I've had it produce tables with multiple columns pulling info from different sources for different columns before so that's definitely not a hard limit... so would be happy to come back to you either with advice on how to do it next time, or with agreement that having tried it myself it is indeed ChatGPT not your prompting that was the problem.
Prompt:
I would like a list of east Indiamen from 1750 to 1800 where you can find how many tons burthen and how many crew. Show as a chart and give me the wikipedia links to the ships. Do not include any ships that do not have wikipedia links.
Here's my customization:
What do you do?:
Software Engineer
What traits should ChatGPT have?:
Show all the options
Be practical above all.
Anything else ChatGPT should know about you?:
I’m an author of science fiction and fantasy.
I like world building for stories.
I know there's hundreds of ways to phrase this and I could probably trick it into generating the chart first and finding the wikipedia links second. :) I can’t decide whether I’m more tempted to feed it “Using Metadata to Find Paul Revere” as a prompt or try to see if it identifies Obra Dinn as an East Indiaman
Sorry for going off topic here but I've had the same experience.
I'm not sure which update improved 4o so greatly but I get better responses from 4o than from o4-mini, o4-mini-high, and even o3. o4 and o3 have been disappointing lately - they have issues understanding intent, they have issues obeying requests, and it happened multiple times that they forgot the context even though the conversation consisted of only 4 messages without a huge number of tokens. In terms of chain-of-thought models I prefer DeepSeek over any OpenAI model (4.5 research seems great, but it’s just way too expensive).
It's rather disappointing how OpenAI releases new models that seem incredible, and then, to reduce the cost of running them, they slowly slim these models down until they're just not that good anymore.
No need for the apology, and FYI I broadly agree with everything you say (except about 4.5, which I don't actively disagree with I just haven't played with it myself).
> How does anyone have the confidence to ship a legal brief that an AI produced without checking it thoroughly?
Because its makers don't care about precision or correctness. They care only about convincing the people that matter that gaping software bugs are "hallucinations" that can never be fixed 100%, and that that is an acceptable outcome.
For every story about AI-assisted legal briefs with tons of mistakes, there will be 100 PR-ridden pieces about how $LATEST_VERSION has passed the bar exam or has discovered a miracle drug. There will never be a story about AI successfully arguing a real case however, because the main goal was only about selling a vision of labor replacement. Whether or not the replacement can do the job as specified is immaterial.
The media really hypes the capabilities up, so if you're new to it and it spews out something that looks very detailed and plausible, you just think "wow it worked". They would have no instinct for the failure modes either. The reference points here would be like a paralegal or a computer search tool. You would only really expect errors of omission: the paralegal wants to keep their job, and search cannot find things that don't exist. In that frame of mind when you see that the returned document seems to cover the relevant points and makes sense when you skim it, it seems like job done. The public doesn't get that the LLM will just completely bold-faced make stuff up.
Everything you’ve said is correct. Now picture a quiet spread of subtle defects seeping through countless codebases, borne on the euphoria of GenAI driven “productivity”. When those flaws surface, the coming AI winter will be long and bitter.
> How does anyone have the confidence to ship a legal brief that an AI produced without checking it thoroughly?
They're treating it like they would a paralegal. Typically this means giving a research task and then using their results, but sometimes lawyers will just have them write documents and ship it, so to speak.
This is making me realize that Tech Bros treat chat GPT like the 1930s secretary they never got to have
I use it in much the same way as you, and it's been extremely beneficial. But I also would not dream of signing my name on something that has been independently produced by AI, it's just too often blatantly wrong on specifics.
I think people who do are simply not aware that AI is not deterministic the same way a calculator is. I would feel entirely safe signing my name on a mathematical result produced by a calculator (assuming I trusted my own input).
LLMs are deterministic [0]. An LLM is a pure function that takes a list of tokens and returns a set of token probabilities. To make it "chat" you use the generated probabilities to pick a token, append that token to the list, and run the LLM again. Any randomness is introduced by the external component that picks a token using the probabilities: the sampler. Always picking the most likely token is a valid strategy.
The problem is that all output is a "hallucination", and only some of it coincidentally matches the truth. There's no internal distinction between hallucination and truth.
[0] Theoretically; race conditions in a parallel implementation could add non-determinism.
True, though in practice speed optimizations and instabilities on the GPU often lead to LLMs being very non-determanistic in practice.
Which doesn't detract from your main point: there's not a lot of distinction between hallucinations and what we'd consider to be the "real thing." There have been various attempts to measure hallucinations, and we can figure out things like how confident the model is in a particular answer...but there's nothing grounding that answer. Saturate the dataset with the wrong answer and you'll get an overconfident wrong result.
While this is technically correct, everyday use of LLMs involves a non-zero temperature, so they (the whole package that people think of as “AI”) are non-deterministic in practice.
No, hallucinations occur when LLM is missing information.
That's not correct, and seems to be based on a common misunderstanding of how LLMs work, the rough idea being that when the info the model is being asked for had been in the data used for training, it "looks it up" not unlike software looking up info from a huge database of general knowledge, and that when that lookup fails it falls back to making stuff up. But that's wrong, the models are actually doing the exact same thing when they're hallucinating as when they're correct, just the result is different.
Hallucinations happen when the model determines that the most likely suitable string of tokens turns out to contain incorrect information, regardless of whether the correct information is "missing" or whether the correct information actually would have been outputted had it, when selecting the first token of the response, instead selected the option that it considers second best rather than best.
Whether or not a piece of information was in the training set can obviously influence the likelihood of a model hallucinating when asked about the subject, but it can easily hallucinate about stuff that was in the training and it can also get things right that weren't in the training data.
If an LLM happens to know the answer to your question, that answer will have the greatest weight, and will therefore become a non-hallucinated output. Otherwise the output will be hallucinated. Note that a hallucination may manifest as an attempt to extrapolate, which may be successful. If you query an LLM with prior knowledge that the LLM doesn’t know the answer, you are guaranteed to receive a hallucinated output.
Or at least this is how I interpret the term.
But that's not how they actually work.
> "If an LLM happens to know the answer to your question, that answer will have the greatest weight"
An LLM doesn’t “know” anything in the way you’re imagining. It doesn’t have stored facts or indexed knowledge to check against, it just has weights learned between token sequences, and it outputs whatever next token is assigned the highest probability given the prompt and prior context. That might happen to produce a correct answer (and people are obviously working hard to make the models produce right answers as often as possible), but it might just as easily produce a plausible-sounding but wrong one, even if the correct information was in the training data. Because that correct information being there doesn't guarantee it will have the highest weighting ever, yet alone the highest weighting in all contexts of previous tokens and in all temperature settings.
You’re right that hallucinations can sometimes look like “extrapolations” that happen to land correctly, but that’s incidental. It’s still doing the same token-by-token probability selection regardless of whether it ends up right or wrong.
Framing it around “missing knowledge” vs “existing knowledge” is misleading intuition. It’s better to think about it in terms of probability distributions over token sequences: the model’s training biases it toward correct sequences more often than incorrect ones, but there’s nothing fundamental in the architecture that guarantees that if the answer was present in training, it will always beat out wrong guesses.
p.s. It's late at night here and I'm about to go to bed, so apologies if I've not explained well in this comment - I gave it to ChatGPT hoping it could tidy things up for me and it just made a way more confusing version so I'm posting it as is :D Let me know if my explanation still isn't clear and I could try again, or answer any questions you have, tomorrow
> An LLM doesn’t “know” anything in the way you’re imagining. It doesn’t have stored facts or indexed knowledge to check against
Neither does your brain and yet you do "know" something.
> but it might just as easily produce a plausible-sounding but wrong one, even if the correct information was in the training data
If the majority of information that was in the LLM's training data said 1 + 1 = 3, the LLM will tell you that 1 + 1 = 3, even if there was some information that said 1 + 1 = 2, and there's nothing wrong with that because the LLM is not supposed to fact-check.
> the model’s training biases it toward correct sequences more often than incorrect ones
No, the model's training biases it toward sequences that appear more frequently.
It's trivial to prove this is wrong: invert relationships it knows about and it fails to answer based on knowledge it previously demonstrated (even with loads of hints)
https://chatgpt.com/share/680dc86c-f0dc-800d-9f04-57ba2f126a...
https://chatgpt.com/share/680dc90b-de28-800d-92b6-f2ef824777...
Note how applying increasing pressure to answer was what caused the hallucination: hallucinations aren't tied to if the model "knows" something.
Once the tokens output don't fall into the start of some varation of "I don't know", the model is going to answer regardless of what it knows.
> If an LLM happens to know the answer to your question
You're missing the point. It doesn't "know" anything. The only thing it can "know" is the statistical relationships between tokens in its dataset. It doesn't "know" anything about the meaning of those tokens. It doesn't even "know" whether it "knows" anything or not. The best it can do is "Here's a recursively generated string of ASCII codes that are statistically likely to follow each other according to the data corpus."
It's Rashomon. It can point you in the right directions a lot of the time, but there's no getting around the fact that you have to double-check its answers with external sources.
> Or at least this is how I interpret the term.
That's not a very useful interpretation because it's not grounded in technical reality.
> It doesn't "know" anything.
The word know is an abstraction I use in order to avoid going into technical details.
> That's not a very useful interpretation because it's not grounded in technical reality.
My interpretation aligns with what people generally mean by hallucination, and it's definitely more useful than saying that any output is hallucination.
The difference is: what people generally mean by hallucination is "LLM said something wrong as if it was right". And what you are adding to that in your previous comments is the concept of whether or not the LLM knows the right answer. Which it never does. That's where your interpretation and the general interpretation differ.
I'm afraid I don't personally see how to explain more clearly, so will just say instead that given multiple people are in this thread telling you your understanding of how LLMs work isn't right, please consider that to at least be a possibility and look into it further rather than digging deeper into your current beliefs.
But then isn't this also technically true that any software including a pseudo-random number generator is deterministic ? (Starting with itself, like that sampler you mention ?)
And while it might be important in some contexts, like debugging using either the exact same or different seeds, isn't this one of them where it rather confuses the issue ?
Lindell's lawyer claimed that somehow the preliminary copy (before human editing) got submitted to the court - that they actually did the work to fix it, but then slipped up in submitting it.
I could see that, especially with sloppy lawyers in the first place. Or, I could see it being a convenient "the dog ate my homework" excuse.
Having not looked into it, I would guess that his lawyers know they aren’t going to get paid any time soon.