Item 44239674

lispisok • 3 days ago

I swear every time a new model is released it's great at first but then performance gets worse over time. I figured they were fine-tuning it to get rid of bad output which also nerfed the really good output. Now I'm wondering if they were quantizing it.

Tiberium • 3 days ago

I've heard lots of people say that, but no objective reproducible benchmarks confirm such a thing happening often. Could this simply be a case of novelty/excitement for a new model fading away as you learn more about its shortcomings?

7 replies

Kranar • 2 days ago

I used to think the models got worse over time as well but then I checked my chat history and what I noticed isn't that ChatGPT gets worse, it's that my standards and expectations increase over time.

When a new model comes out I test the waters a bit with some more ambitious queries and get impressed when it can handle them reasonably well. Over time I take it for granted and then just expect it to be able to handle ever more complex queries and get dissappointed when I hit a new limit.

1 reply

echelon • 2 days ago

Re-run your historical queries, or queries that are similarly shaped.

2 replies

sakesun • 2 days ago

They could cache that :)

1 reply

echelon • 2 days ago

That would make for a very interesting timing attack.

throwaway314155 • 2 days ago

Sounds like a _whole_ thing.

herval • 2 days ago

there's definitely measurements (eg https://hdsr.mitpress.mit.edu/pub/y95zitmz/release/2 ) but I imagine they're rare because those benchmarks are expensive, so nobody keeps running them all the time?

Anecdotally, it's quite clear that some models are throttled during the day (eg Claude sometimes falls back to "concise mode" - with and without a warning on the app).

You can tell if you're using Windsurf/Cursor too - there are times of the day where the models constantly fail to do tool calling, and other times they "just work" (for the same query).

Finally, there's cases where it was confirmed by the company, like Gpt-4o's sycopanth tirade that very clearly impacted its output (https://openai.com/index/sycophancy-in-gpt-4o/)

3 replies

Deathmax • 2 days ago

Your linked article is specifically comparing two different versioned snapshots of a model and not comparing the same model across time.

You've also made the mistake of conflating what's served via API platforms which are meant to be stable, and frontends which have no stability guarantees, and are very much iterated on in terms of the underlying model and system prompts. The GPT-4o sycophancy debacle was only on the specific model that's served via the ChatGPT frontend and never impacted the stable snapshots on the API.

I have never seen any sort of compelling evidence that any of the large labs tinkers with their stable, versioned model releases that are served via their API platforms.

1 reply

herval • 2 days ago

Please read it again. The article is clearly comparing gpt4 to gpt4, and gpt3.5 to gpt3.5, in march vs june 2023

1 reply

Deathmax • 2 days ago

I did read it, and I even went to their eval repo.

> At the time of writing, there are two major versions available for GPT-4 and GPT-3.5 through OpenAI’s API, one snapshotted in March 2023 and another in June 2023.

openaichat/gpt-3.5-turbo-0301 vs openaichat/gpt-3.5-turbo-0613, openaichat/gpt-4-0314 vs openaichat/gpt-4-0613. Two _distinct_ versions of the model, and not the _same_ model over time like how people like to complain that a model gets "nerfed" over time.

drewnick • 2 days ago

I feel this too. I swear some of the coding Claude Code does on weekends is superior to the weekdays. It just has these eureka moments every now and then.

1 reply

herval • 2 days ago

Claude has been particularly bad since they released 4.0. The push to remove 3.7 from Windsurf hasn’t helped either. Pretty evident they’re trying to force people to pay for Claude Code…

Trusting these LLM providers today is as risky as trusting Facebook as a platform, when they were pushing their “opensocial” stuff

glitch253 • 2 days ago

Cursor / Windsurf's degraded functionality is exactly why I created my own system:

https://github.com/mpfaffenberger/code_puppy

cainxinth • 2 days ago

I assumed it was because the first week revealed a ton of safety issues that they then "patched" by adjusting the system prompt, and thus using up more inference tokens on things other than the user's request.

bobxmax • 2 days ago

My suspicion is it's the personalization. Most people have things like 'memory' on, and as the models increasingly personalize towards you, that personalization is hurting quality rather than helping it.

Which is why the base model wouldn't necessarily show differences when you benchmarked them.

colordrops • 2 days ago

It's probably less often quantizing and more often adding more and more to their hidden system prompt to address various issues and "issues", and as we all know, adding more context sometimes has a negative effect.

85392_school • 2 days ago

I think it's an illusion. People have been claiming it since the GPT-4 days, but nobody's ever posted any good evidence to the "model-changes" channel in Anthropic's Discord. It's probably just nostalgia.

tshaddox • 2 days ago

Yeah, it’s almost certainly hallucination (by the human user).

JoshuaDavid • 2 days ago

I suspect what's happening is that lots of people have a collection of questions / private evals that they've been testing on every new model, and when a new model comes out it sometimes can answer a question that previous models couldn't. So that selects for questions where the new model is at the edge of its capabilities and probably got lucky. But when you come up with a new question, it's generally going to be on the level of the questions the new model is newly able to solve.

Like I suspect if there was a "new" model which was best-of-256 sampling of gpt-3.5-turbo that too would seem like a really exciting model for the first little bit after it came out, because it could probably solve a lot of problems current top models struggle with (which people would notice immediately) while failing to do lots of things that are a breeze for top models (which would take people a little bit to notice).

nabla9 • 3 days ago

It seems that least Google is overselling their compute capacity.

You pay monthly fee, but Gemini is completely jammed 5-6 hours when North America is working.

2 replies

baq • 2 days ago

Gemini is simply that good. I’m trying out Claude 4 every now and then and go back to Gemini to fix its mess…

3 replies

energy123 • 2 days ago

Gemini is the best model in the world. Gemini is the worst web app in the world. Somehow those two things are coexisting. The web devs in their UI team have really betrayed the hard work of their ML and hardware colleagues. I don't say this lightly - I say this after having paid attention to critical bugs, more than I can count on one hand, that persisted for over a year. They either don't care or are grossly incompetent.

2 replies

thorum • 2 days ago

Try AI Studio if you haven’t already: https://aistudio.google.com/

1 reply

koakuma-chan • 2 days ago

https://ai.dev

nabla9 • 2 days ago

Well said.

Google is best in pure AI research, both quality and volume. They have sucked at productization for years. Not not just AI but other products as well. Real mystery.

1 reply

energy123 • 2 days ago

I don't understand why they can't just make it fast and go through the bug reports from a year ago and fix them. Is it that hard to build a box for users to type text into without it lagging for 5 seconds or throwing a bunch of errors?

1 reply

baq • 1 day ago

If it doesn’t make sense, it makes sense. Nobody will get their promo by ‘fixing bugs’.

fasterthanlime • 2 days ago

Funny, I have the exact opposite experience! I use Claude to fix Gemini’s mess.

1 reply

symfoniq • 2 days ago

Maybe LLMs just make messes.

hgomersall • 2 days ago

I heard that, but I'm getting consistent garbage from Gemini.

1 reply

dayjah • 2 days ago

For code? Use the context7 mcp.

edzitron • 2 days ago

When you say "jammed," how do you mean?

JamesBarney • 2 days ago

I'm pretty sure this is just a psychological phenomenon. When a new model is released all the capabilities the new model has that the old model lacks are very salient. This makes it seem amazing. Then you get used to the model, push it to the frontier, and suddenly the most salient memories of the new model are it's failures.

There are tons of benchmarks that don't show any regressions. Even small and unpublished ones rarely show regressions.

mhitza • 3 days ago

That was my suspicion when I first deleted my account, when it felt the output got worse in ChatGPT and I found highly suspicious when I saw an errand davinci model keyword in the chatgpt url.

Now I'm feeling similarly with their image generation (which is the only reason I created a paid account two months ago, and the output looks more generic by default).

1 reply

beering • 2 days ago

Are you able to quantify how quickly your perception gets skewed by how long you use the models?

1 reply

mhitza • 2 days ago

I can't quantity it for my past experience, that was more than a year ago, and I wasn't using ChatGPT daily at the time either.

This time around it felt pretty stark. I used ChatGPT to create at most 20 different image compositions. And after a couple of good ones at first, it felt worse after. One thing I've noticed recently is that when working on vector art compositions, the results start more simplistic, and often enough look like clipart thrown together. This wasn't my experience first time around. Might be temperature tweaks, or changes in their prompt that lead to this effect. Might be some random seed data they use, who knows.

beering • 2 days ago

It’s easy to measure the models getting worse, so you should be suspicious that nobody who claims this has scientific evidence to back it up.

solfox • 3 days ago

I have seen this behavior as well.