Item 44239601 - HN

behnamoh • 2 days ago

how do we know it's not a quantized version of o3? what's stopping these firms from announcing the full model to perform well on the benchmarks and then gradually quantizing it (first at Q8 so no one notices, then Q6, then Q4, ...).

I have a suspicion that's how they were able to get gpt-4-turbo so fast. In practice, I found it inferior to the original GPT-4 but the company probably benchmaxxed the hell out of the turbo and 4o versions so even though they were worse models, users found them more pleasing.

19

CSMastermind • 2 days ago

This is almost certainly what they're doing and rebranding the original o3 model as "o3-pro"

tedsanders • 2 days ago

Nope, not what we’re doing.

o3 is still o3 (no nerfing) and o3-pro is new and better than o3.

If we were lying about this, it would be really easy to catch us - just run evals.

(I work at OpenAI.)

fastball • 2 days ago

Anecdotal, but about a week ago I noticed a sharp drop in o3 performance. For many tasks I will compare Gemini 2.5 Pro with o3, running the same prompt in both. Generally for my personal use o3 and G2.5P have been neck-and neck over the last months, with responses I have been very happy with.

However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).

This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.

IanCal • 2 days ago

Are you sure you're using the same models? G2.5P updated almost exactly a week ago.

fastball • 1 day ago

G2.5P might've updated, but that's not the model I noticed a difference. o3 seemed noticeably dumber in isolation, not just compared to G2.5P.

But yes, perhaps the answer is that about a week ago I started asking subconsciously harder questions, and G2.5P handled them better because it had just been improved, while o3 had not so it seemed worse. Or perhaps G2.5P has always had more capacity than o3, and I wasn't asking hard enough questions to notice a difference before.

fny • 2 days ago

Unrelated: Can you all come up with a better naming scheme for your models? I feel like this is a huge UX miss.

o4-mini-high o4-mini o3 o3-pro gpt-4o

Oy.

energy123 • 2 days ago

Is it o3 (low), o3 (medium) or o3 (high)? Different model names have crept into the various benchmarks over the last few months.

tedsanders • 2 days ago

o3 is a model, and reasoning effort (high/medium/low) is a parameter that goes into the model.

o3 pro is a different thing - it’s not just o3 with maximum remaining effort.

tauntz • 2 days ago

Why's it called o3 then if it's a different thing? There's already a rather extreme amount of confusion with the model names and it's not clear _at all_ which model would be "the best" in terms of response quality.

Here's the current state with version numbers as far as I can piece it together (using my best guess at naming of each component of the version identifier. Might be totally wrong tho):

1) prefix (optional): "gpt-", "chatgpt-"

2) family (required): o1, o3, o4, 4o, 3.5, 4, 4.1, 4.5,

3) quality? (optional): "nano", "mini", "pro", "turbo"

4) type (optional): "audio", "search"

5) lifecycle (optional): "preview", "latest"

6) date (optional): 2025-04-14, 2024-05-13, 1106, 0613, 0125, etc (I assume the last ones are a date without a year for 2024?)

7) size (optional): "16k"

Some final combinations of these version number components are as small as 1 ("o3") or as large as 6 ("gpt-4o-mini-search-preview-2024-12-17").

Given this mess, I can't blame people assuming that the "best" model is the one with the "biggest" number, which would rank the model families as: 4.5 (best) > 4.1 > 4 > 4o > o4 > 3.5 > o3 > o1 (worst).

tedsanders • 2 days ago

o3 pro is based on o3 and its style and outputs will be quite similar to o3.

As an analogy, think of it like this:

o3-low ~ Ford Mustang with the accelerator gently pressed

o3-medium ~ Ford Mustang with the accelerator pressed

o3-high ~ Ford Mustang with the accelerator heavily pressed

o3 pro ~ Ford Mustang GT

Even though a Mustang GT is a different car than a Mustang, you don’t give it a totally different name (eg Palomino). The similarity in name signals it has a lot of the same characteristics but a souped up engine. Same for o3 pro.

Fun fact: before GPT-4, we had a unified naming scheme for models that went {modality}-{size}-{version}, which resulted in names like text-davinci-002. We considered launching GPT-4 as something like text-earhart-001, but since everyone was calling it GPT-4 anyway, we abandoned that system to use the name GPT-4 that everyone had already latched onto. Kind of funny how our original unified naming scheme made room for 999 versions, but we didn't make it past 3.

Edit: When I say the Mustang GT is a different car than a Mustang - I mean it literally. If you bought a Mustang GT and someone delivered a Mustang with a different trim, you wouldn't say "great, this is just what I ordered, with the same features/behavior/value." That we call it a different trim is a linguistic choice to signal to consumers that it's very similar, and built on the same production line, but comes with a different engine or different features. Similar to o3 pro.

dwohnitmok • 2 days ago

Can you elaborate on what you mean that o3 pro is a GT? In particular I don't understand how to reconcile what you're saying that o3 pro is in some way fundamentally different from o3 (albeit based on o3) with this tweet:

> As o3-pro uses the same underlying model as o3, full safety details can be found in the o3 system card.

https://x.com/OpenAI/status/1932530423911096508

tedsanders • 2 days ago

Yeah, I totally get the confusion here. Unfortunately I can't give the recipe behind our models, so there's going to be some irreducible blurriness here, but the following statements are all true:

- o3 pro is based on o3

- o3 pro uses the same underlying model as o3

- o3 pro is similar to o3, but is a distinct thing that's smarter and slower

- o3 pro is not o3 with longer reasoning

In my analogy, o3 pro vs o3 is more than just an input parameter (e.g., not just the accelerator input) but less than a full difference in model (e.g., Ford Mustang vs F150). It's in between, kind of like car trim with the same body but a stronger engine. Imperfect analogy, and I apologize if this doesn't feel like it adds any clarity. At the end of the day, it doesn't really matter how it works - what matters is if people find it worth using.

stonogo • 2 days ago

This analogy might work better if the Mustang GT weren't, in fact, the same car as the Mustang. It's just a trim level, not a different car.

energy123 • 2 days ago

My guess is this comes from an org structure where you have multiple "pods" working on different research. Who comes up with the next shippable model and when that happens is kind of random and the chaotic naming system comes from that. It's just my speculation and could be wildly wrong.

rat9988 • 2 days ago

o3 and o3-pro aren't the same thing still makes sense though.

fragmede • 2 days ago

Could someone there maybe possibly use, oh I dunno, ChatGPT and come up with some better product names?

MattDaEskimo • 2 days ago

What's with the dropped benchmark performance compared to the original o3 release? It was disappointing to not see o4-mini on it as well

refulgentis • 2 days ago

What dropped benchmark performance?

MattDaEskimo • 2 days ago

o3 scores noticeably worse on benchmarks compared to its original announcement benchmarks

refulgentis • 2 days ago

Any link / source / anything? You got quite an opportunity here, OpenAI employee claiming there's no difference and you got something that shows there is.

MattDaEskimo • 2 days ago

Yes, the original announcement for o3 and o4-mini:

https://openai.com/index/introducing-o3-and-o4-mini/

o3 scored 91.6 on AIME 2024. 83.3 on GPQA

o4-mini scored 93.4, 81.4 GPQA

Then, the new announcement

https://help.openai.com/en/articles/6825453-chatgpt-release-...

o3 scored 90 on AIME 2024, 81 GPQA

o4-mini wasn't measured

---

Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with

meta_ai_x • 2 days ago

Just because you work at openAI doesn't mean you know everything about openAI especially as strategic as nerfing models to save costs

bn-l • 2 days ago

Not quantized?

tedsanders • 2 days ago

Not quantized. Weights are the same.

If we did change the model, we'd release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we ever silently changed models, so we never do this [1].

[1] `chatgpt-4o-latest` being an explicit exception

linsomniac • 2 days ago

>we'd release it as a new model with a new name

Speaking of a new name. I'll donate the API credits to run a "choose a naming scheme for AI models that isn't confusing AF" for OpenAI.

thegeomaster • 2 days ago

Google could at least learn something from this attitude, given their recent 03-25 -> 05-06 model alias switcharoo with 0 notice :)

johnb231 • 2 days ago

That is a preview / beta model with no expectation of stability. Google did nothing wrong there. No one should be using a preview model in production.

thegeomaster • 2 days ago

Hard disagree. Of course technically they didn't do anything explicitly against the public guidance (the checks and balances would never let them), but naming a model with a date very strongly implies immutability.

It's the same logic of why UB in C/C++ isn't a license to do whatever the compiler wants. We're humans and we operate on implications, common-sense assumptions and trust.

johnb231 • 2 days ago

The model is labelled as Preview. There are no guarantees of stability or availability for Preview models. Not intended for production workloads.

https://cloud.google.com/products?hl=en#product-launch-stage...

"At Preview, products or features are ready for testing by customers. Preview offerings are often publicly announced, but are not necessarily feature-complete, and no SLAs or technical support commitments are provided for these. Unless stated otherwise by Google, Preview offerings are intended for use in test environments only. The average Preview stage lasts about six months."

refulgentis • 2 days ago

There hasn't been a non-preview Gemini since...November? The previews are the same as everyone else's release cadance, "preview" is just a magic wand that meant the Launchcal (google's internal signoff tool, i.e. "wave will never happen again) needs less signoffs. Then it got to the point date-pinned models were getting swapped in, in the name of doing us a favor, and it's a...novel idea, we can both agree at the least.

I bet someone at Google would be a bit surprised to see someone jumping to legalese to act like this...novelty...is inherently due to the preview status, and based on anything more than a sense that there's no net harm done to us if it costs the same and is better.

I'm not sure they're wrong.

But it also leads to a sort of "nobody knows how anything works because we have 2^N configs and 5 bits" - for instance, 05-06 was also upgraded to 06-05. Except it wasn't, if you sent variable thinking to 05-06 after upgrade it'd fail. (and don't get me started on the 5 different thinking configurations for Gemini 2.5 flash thinking vs. gemini 05-06 vs. 06-05 and 0 thinking)

johnb231 • 2 days ago

I honestly have no idea what you are trying to say.

It's a preview model - for testing only, not for production. Really not that complicated.

refulgentis • 2 days ago

So you don't have anything to contribute beyond, and aren't interested in anything beyond, citing of terms?

Why are you in the comments section of a engineering news site?

(note: beyond your, excuse me while I'm direct now, boorish know-nothing reply, the terms you are citing have nothing to do with the thing people are actually discussing around you, despite your best efforts. It doesn't say "we might swap in a new service, congrats!", nor does it have anything to say about that. Your legalese at most describes why they'd pull 05-06, not forward 05-06 to 06-05. This is a novel idea.)

johnb231 • 2 days ago

This case was simply a matter of people not understanding the terms of service. There is nothing more to be said. It's that simple. The "engineers" should know that before deploying to prod. Basic competence.

And I mean I genuinely do not understand what you are trying to say. Couldn't parse it.

lcnPylGDnU4H9OF • 2 days ago

> And I mean I genuinely do not understand what you are trying to say. Couldn't parse it.

It’s always worth considering that this may be your problem. If you still don’t get it, the only valuable reply is one which asks a question. Also, including “it’s not that complicated” only serves to inflame.

refulgentis • 2 days ago

John, do you understand that the thing you're quoting says "We reserve the right to pull things", not "We reserve the right to swap in a new service"?

Do you understand that even if it did say that, that wasn't true either? It was some weird undocumentable half-beast?

I have exactly your attitude about their cavalier use of preview for all things Gemini, and even people's use of the preview models.

But I've also been on this site for 15 years and am a bit wow'd by your interlocution style here -- it's quite rare to see someone flip "the 3P provider swapped the service on us!" into "well they said they could turn it off, of course you should expect it to be swapped for the first time ever!" insert dull sneer about the quality of other engineers

johnb231 • 2 days ago

How is this so hard to understand? It's a preview service for testing only, not intended for production.

I am done with this thread. We are going around in circles.

refulgentis • 2 days ago

Well, no. Well, sure. You're done, but we're not going in circles. It'd just do too much damage to you to have to answer the simple question "Where does the legalese say they can swap in a new service?", so you have to pretend this is circular and just all-so-confusing, de facto, we have to pretend it is confusing and/or obviously wrong to use any Gemini 2+ at all.

It's a cute argument, as I noted, I'm emotionally sympathetic to it even, it's my favorite "get off my lawn." However, I've also been on the Internet long enough to know you write back, at length, when people try anti-intellectualism and why-are-we-even-talking-about-this as interaction.

johnb231 • 2 days ago

https://cloud.google.com/terms/service-terms

"b. Disclaimer. PRE-GA OFFERINGS ARE PROVIDED “AS IS” WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES OR REPRESENTATIONS OF ANY KIND. Pre-GA Offerings (i) may be changed, suspended or discontinued at any time without prior notice to Customer and (ii) are not covered by any SLA or Google indemnity. Except as otherwise expressly indicated in a written notice or Google documentation, (A) Pre-GA Offerings are not covered by TSS, and (B) the Data Location Section above will not apply to Pre-GA Offerings."

0xbadcafebee • 2 days ago

There's a very large gulf between "what makes sense to Google" and "what makes sense to Human Beings". I have so many rants about Google's poor treatment of "customers" that they feel like Oracle to me now. Like every time I use them, I'm really just falling prey to my own misguided idea that this time I won't get screwed over.

johnb231 • 2 days ago

The users aren't random "human beings" in this case. They are professional software developers who are expected to understand the basics. Deploying that model into production shows a lack of basic competence. It is clearly marked "preview" and is for test only.

0xbadcafebee • 2 days ago

That may be true, but it doesn't make the customer's claims not true. What Google did was counter-intuitive. That's a fact. Pointing at some fine print and saying "uhh actually, technically it's your stupid human brain is the problem, not us! we technically are allowed to do anything we want, just look at the fine print!!" does not make things better. We are human beings; we are flawed. That much should be obvious to any human organization. If you don't know how to make things that don't piss off human beings, the problem isn't with the humans.

If the "preview release" you were using was v0.3, and suddenly it started being v0.6 without warning, that would be insane. The only point of providing a version number is to give people an indicator of consistency. The datestamp is a version number. If they didn't want us to expect consistency, they should not have given it a version number. That's the whole point of rolling release branches, they have no version. You don't have "v2.0" of a rolling release, you just have "latest". They fucked up by giving it a datestamp.

This is an extremely old and well-known problem with software interfaces. Either you version it or you don't. If you do version it, and change it, you change the version, and give people dependent on the old version some time to upgrade. Otherwise it breaks things, and that pisses people off. The alternative is not versioning it, which is a signal that there is no consistency to be expected. Any decent software developer should have known all this.

And while I'm at it: what's with the name flip-flopping? In 2014, GCP issued a PR release explaining It was no longer using "Preview", but "Alpha" and "Beta" (https://cloudplatform.googleblog.com/2014/10/new-release-pha...). But the link you showed earlier says "Alpha" and "Beta" are now deprecated. But no PR release? I guess that's our bad for not constantly reading the fine print and expecting it to revert back to something from 11 years ago.

ant6n • 2 days ago

It was definitely annoying when o1 disappeared over night, my impression is that was better at some tasks than o3.

csomar • 2 days ago

I think the parent-parent poster has explained why we can't trust you (and work on OpenAI doesn't help they way you think it does).

I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.

This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.

rfoo • 2 days ago

An (arbitrarily) quantized model is a totally different model, compared to the original.

Reubachi • 2 days ago

I'm not totally sure how you at this point in your online presence associate someone stating their job as a "brag" and not what it really is, providing transparency/disclosure before stating their thoughts.

This is HN and not reddit.

"I didn't read the ToS, like everyone else, but my guess..."

Ah, there it is.

mliker • 2 days ago

Where are you getting this information? What basis do you have for making this claim? OpenAI, despite its public drama, is still a massive brand and if this were exposed, would tank the company's reputation. I think making baseless claims like this is dangerous for HN

beering • 2 days ago

I think Gell-Mann amnesia happens here too, where you can see how wrong HN comments are on a topic you know deeply, but then forget about that when reading the comments on another topic.

behnamoh • 2 days ago

> rebranding the original o3 model as "o3-pro"

interesting take, I wouldn't be surprised if they did that.

anticensor • 2 days ago

-pro models appear to be a best-of-10 sampling of the original full size model

Szpadel • 2 days ago

how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.

if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time

but it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them

anticensor • 2 days ago

> if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time

remember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent

joshstrange • 2 days ago

I think the idea is they use another/same model to judge all the results and only return the best one to the user.

anticensor • 2 days ago

I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.

spott • 2 days ago

I believe it is a majority vote kinda thing, rather than a best single result.

lispisok • 2 days ago

I swear every time a new model is released it's great at first but then performance gets worse over time. I figured they were fine-tuning it to get rid of bad output which also nerfed the really good output. Now I'm wondering if they were quantizing it.

Tiberium • 2 days ago

I've heard lots of people say that, but no objective reproducible benchmarks confirm such a thing happening often. Could this simply be a case of novelty/excitement for a new model fading away as you learn more about its shortcomings?

Kranar • 2 days ago

I used to think the models got worse over time as well but then I checked my chat history and what I noticed isn't that ChatGPT gets worse, it's that my standards and expectations increase over time.

When a new model comes out I test the waters a bit with some more ambitious queries and get impressed when it can handle them reasonably well. Over time I take it for granted and then just expect it to be able to handle ever more complex queries and get dissappointed when I hit a new limit.

echelon • 2 days ago

Re-run your historical queries, or queries that are similarly shaped.

sakesun • 2 days ago

They could cache that :)

echelon • 2 days ago

That would make for a very interesting timing attack.

throwaway314155 • 2 days ago

Sounds like a _whole_ thing.

herval • 2 days ago

there's definitely measurements (eg https://hdsr.mitpress.mit.edu/pub/y95zitmz/release/2 ) but I imagine they're rare because those benchmarks are expensive, so nobody keeps running them all the time?

Anecdotally, it's quite clear that some models are throttled during the day (eg Claude sometimes falls back to "concise mode" - with and without a warning on the app).

You can tell if you're using Windsurf/Cursor too - there are times of the day where the models constantly fail to do tool calling, and other times they "just work" (for the same query).

Finally, there's cases where it was confirmed by the company, like Gpt-4o's sycopanth tirade that very clearly impacted its output (https://openai.com/index/sycophancy-in-gpt-4o/)

Deathmax • 2 days ago

Your linked article is specifically comparing two different versioned snapshots of a model and not comparing the same model across time.

You've also made the mistake of conflating what's served via API platforms which are meant to be stable, and frontends which have no stability guarantees, and are very much iterated on in terms of the underlying model and system prompts. The GPT-4o sycophancy debacle was only on the specific model that's served via the ChatGPT frontend and never impacted the stable snapshots on the API.

I have never seen any sort of compelling evidence that any of the large labs tinkers with their stable, versioned model releases that are served via their API platforms.

herval • 2 days ago

Please read it again. The article is clearly comparing gpt4 to gpt4, and gpt3.5 to gpt3.5, in march vs june 2023

Deathmax • 2 days ago

I did read it, and I even went to their eval repo.

> At the time of writing, there are two major versions available for GPT-4 and GPT-3.5 through OpenAI’s API, one snapshotted in March 2023 and another in June 2023.

openaichat/gpt-3.5-turbo-0301 vs openaichat/gpt-3.5-turbo-0613, openaichat/gpt-4-0314 vs openaichat/gpt-4-0613. Two _distinct_ versions of the model, and not the _same_ model over time like how people like to complain that a model gets "nerfed" over time.

drewnick • 2 days ago

I feel this too. I swear some of the coding Claude Code does on weekends is superior to the weekdays. It just has these eureka moments every now and then.

herval • 2 days ago

Claude has been particularly bad since they released 4.0. The push to remove 3.7 from Windsurf hasn’t helped either. Pretty evident they’re trying to force people to pay for Claude Code…

Trusting these LLM providers today is as risky as trusting Facebook as a platform, when they were pushing their “opensocial” stuff

glitch253 • 2 days ago

Cursor / Windsurf's degraded functionality is exactly why I created my own system:

https://github.com/mpfaffenberger/code_puppy

cainxinth • 2 days ago

I assumed it was because the first week revealed a ton of safety issues that they then "patched" by adjusting the system prompt, and thus using up more inference tokens on things other than the user's request.

bobxmax • 2 days ago

My suspicion is it's the personalization. Most people have things like 'memory' on, and as the models increasingly personalize towards you, that personalization is hurting quality rather than helping it.

Which is why the base model wouldn't necessarily show differences when you benchmarked them.

colordrops • 2 days ago

It's probably less often quantizing and more often adding more and more to their hidden system prompt to address various issues and "issues", and as we all know, adding more context sometimes has a negative effect.

85392_school • 2 days ago

I think it's an illusion. People have been claiming it since the GPT-4 days, but nobody's ever posted any good evidence to the "model-changes" channel in Anthropic's Discord. It's probably just nostalgia.

tshaddox • 2 days ago

Yeah, it’s almost certainly hallucination (by the human user).

JoshuaDavid • 2 days ago

I suspect what's happening is that lots of people have a collection of questions / private evals that they've been testing on every new model, and when a new model comes out it sometimes can answer a question that previous models couldn't. So that selects for questions where the new model is at the edge of its capabilities and probably got lucky. But when you come up with a new question, it's generally going to be on the level of the questions the new model is newly able to solve.

Like I suspect if there was a "new" model which was best-of-256 sampling of gpt-3.5-turbo that too would seem like a really exciting model for the first little bit after it came out, because it could probably solve a lot of problems current top models struggle with (which people would notice immediately) while failing to do lots of things that are a breeze for top models (which would take people a little bit to notice).

nabla9 • 2 days ago

It seems that least Google is overselling their compute capacity.

You pay monthly fee, but Gemini is completely jammed 5-6 hours when North America is working.

baq • 2 days ago

Gemini is simply that good. I’m trying out Claude 4 every now and then and go back to Gemini to fix its mess…

energy123 • 2 days ago

Gemini is the best model in the world. Gemini is the worst web app in the world. Somehow those two things are coexisting. The web devs in their UI team have really betrayed the hard work of their ML and hardware colleagues. I don't say this lightly - I say this after having paid attention to critical bugs, more than I can count on one hand, that persisted for over a year. They either don't care or are grossly incompetent.

thorum • 2 days ago

Try AI Studio if you haven’t already: https://aistudio.google.com/

koakuma-chan • 2 days ago

nabla9 • 2 days ago

Well said.

Google is best in pure AI research, both quality and volume. They have sucked at productization for years. Not not just AI but other products as well. Real mystery.

energy123 • 2 days ago

I don't understand why they can't just make it fast and go through the bug reports from a year ago and fix them. Is it that hard to build a box for users to type text into without it lagging for 5 seconds or throwing a bunch of errors?

baq • 1 day ago

If it doesn’t make sense, it makes sense. Nobody will get their promo by ‘fixing bugs’.

fasterthanlime • 2 days ago

Funny, I have the exact opposite experience! I use Claude to fix Gemini’s mess.

symfoniq • 2 days ago

Maybe LLMs just make messes.

hgomersall • 2 days ago

I heard that, but I'm getting consistent garbage from Gemini.

dayjah • 2 days ago

For code? Use the context7 mcp.

edzitron • 2 days ago

When you say "jammed," how do you mean?

JamesBarney • 2 days ago

I'm pretty sure this is just a psychological phenomenon. When a new model is released all the capabilities the new model has that the old model lacks are very salient. This makes it seem amazing. Then you get used to the model, push it to the frontier, and suddenly the most salient memories of the new model are it's failures.

There are tons of benchmarks that don't show any regressions. Even small and unpublished ones rarely show regressions.

mhitza • 2 days ago

That was my suspicion when I first deleted my account, when it felt the output got worse in ChatGPT and I found highly suspicious when I saw an errand davinci model keyword in the chatgpt url.

Now I'm feeling similarly with their image generation (which is the only reason I created a paid account two months ago, and the output looks more generic by default).

beering • 2 days ago

Are you able to quantify how quickly your perception gets skewed by how long you use the models?

mhitza • 2 days ago

I can't quantity it for my past experience, that was more than a year ago, and I wasn't using ChatGPT daily at the time either.

This time around it felt pretty stark. I used ChatGPT to create at most 20 different image compositions. And after a couple of good ones at first, it felt worse after. One thing I've noticed recently is that when working on vector art compositions, the results start more simplistic, and often enough look like clipart thrown together. This wasn't my experience first time around. Might be temperature tweaks, or changes in their prompt that lead to this effect. Might be some random seed data they use, who knows.

beering • 2 days ago

It’s easy to measure the models getting worse, so you should be suspicious that nobody who claims this has scientific evidence to back it up.

solfox • 2 days ago

I have seen this behavior as well.

tedsanders • 2 days ago

It's the same model, no quantization, no gimmicks.

In the API, we never make silent changes to models, as that would be super annoying to API developers [1]. In ChatGPT, it's a little less clear when we update models because we don't want to bombard regular users with version numbers in the UI, but it's still not totally silent/opaque - we document all model updates in the ChatGPT release notes [2].

[1] chatgpt-4o-latest is an exception; we explicitly update this model pointer without warning.

[2] ChatGPT Release Notes document our updates to gpt-4o and other models: https://help.openai.com/en/articles/6825453-chatgpt-release-...

(I work at OpenAI.)

ctoth • 2 days ago

From the announcement email:

> Today, we dropped the price of OpenAI o3 by 80%, bringing the cost down to $2 / 1M input tokens and $8 / 1M output tokens.

> We optimized our inference stack that serves o3—this is the same exact model, just cheaper.

hyperknot • 2 days ago

I got 700+ tokens/sec on o3 after the announcement, I suspect it's very much a quantized version.

https://x.com/hyperknot/status/1932476190608036243

dist-epoch • 2 days ago

Or maybe they just brought online much faster much cheaper hardware.

az226 • 2 days ago

Or they are using a speedy add-on decoder.

beering • 2 days ago

Do you also have numbers on intelligence before and after?

zackangelo • 2 days ago

Is that input tokens or output tokens/s?

carter-0 • 2 days ago

An OpenAI researcher claims it's the exact same model on X: https://x.com/aidan_mclau/status/1932507602216497608

ants_everywhere • 2 days ago

Is this what happened to Gemini 2.5 Pro? It used to be very good, but it's started struggling on basic tasks.

The thing that gets me is it seems to be lying about fetching a web page. It will say things are there that were never on any version of the page and it sometimes takes multiple screenshots of the page to convince it that it's wrong.

SparkyMcUnicorn • 2 days ago

The Aider discord community has proposed and disproven the theory that 2.5 Pro became worse, several times, through many benchmark runs.

It had a few bugs here or there when they pushed updates, but it didn't get worse.

ants_everywhere • 2 days ago

Gemini is objectively exhibiting new behavior with the same prompts and that behavior is unwelcome. It includes hallucinating information and refusing to believe it's wrong.

My question is not whether this is true (it is) but why it's happening.

I am willing to believe the aider community has found that Gemini has maintained approximately equivalent performance on fixed benchmarks. That's reasonable considering they probably use a/b testing on benchmarks to tell them whether training or architectural changes need to be reverted.

But all versions of aider I've tested, including the most recent one, don't handle Gemini correctly so I'm skeptical that they're the state of the art with respect to bench-marking Gemini.

SparkyMcUnicorn • 2 days ago

Gemini 2.5 Pro is the highest ranking model on the aider benchmarks leaderboard.

For benchmarks, either Gemini writes code that adheres to the required edit format, builds successfully, and passes unit tests, or it doesn't.

I primarily use aider + 2.5 pro for planning/spec files, and occasionally have it do file edits directly. Works great, other than stopping it mid-execution once in a while.

code_biologist • 2 days ago

My use case is mostly creative writing.

IMO 2.5 Pro 03-25 was insanely good. I suspect it was also very expensive to run. The 05-06 release was a huge regression in quality, most people saying it was a better coder and a worse writer. They tested a few different variants and some were less bad then others, but overall it was painful to lose access to such a good model. The just released 06-05 version seems to be uniformly better than 05-06, with far fewer "wow this thing is dumb as a rock" failure modes, but it still is not as strong as the 03-25 release.

Entirely anecdotally, 06-05 seems to exactly ride the line of "good enough to be the best, but no better than that" presumably to save costs versus the OG 03-25.

In addition, Google is doing something notably different between what you get on AI Studio versus the Gemini site/app. Maybe a different system prompt. There have been a lot of anecdotal comparisons on /r/bard and I do think the AI Studio version is better.

esafak • 2 days ago

Are there any benchmarks that track historical performance?

behnamoh • 2 days ago

good question, and I don't know of any, although it's a no brainer that someone should make it.

a proxy to that may be the anecdotal evidence of users who report back in a month that model X has gotten dumber (started with gpt-4 and keeps happening, esp. with Anthro and OpenAI models). I haven't heard such anecdotal stories about Gemini, R1, etc.

SparkyMcUnicorn • 2 days ago

Aider has one, but it hasn't been updated in months. People kept claiming models were getting worse, but the results proved that they weren't.

esafak • 2 days ago

https://aider.chat/docs/leaderboards/by-release-date.html

__mharrison__ • 2 days ago

Updated yesterday... https://aider.chat/docs/leaderboards/

vitaflo • 2 days ago

That Deepseek price is always hilarious to see in these charts.

SparkyMcUnicorn • 2 days ago

That's not the one I'm referring to. See my other comments or your sibling comment.

benterix • 2 days ago

> users found them more pleasing.

Some users. For me the drop was so huge it became almost unusable for the things I had used it for.

behnamoh • 2 days ago

Same here. One of my apps straight out stopped working because the gpt-4o outputs were noticeably worse than the gpt-4 that I built the app based on.

risho • 2 days ago

Quantization is a massive efficiency gain for near negligible drop in quality. If the tradeoff is quantization for an 80 percent price drop I would take that any day of the week.

behnamoh • 2 days ago

> for near negligible drop in quality

Hmm, that's evidently and anecdotally wrong:

https://github.com/ggml-org/llama.cpp/discussions/4110

spiderice • 2 days ago

You may be right that the tradeoff is worth it, but it should be advertised as such. You shouldn't think you're paying for full o3, even if they're heavily discounting it.

code_biologist • 2 days ago

I would like the option to pay for the unquantized version. For creative or story writing (D&D campaign materials and such) quantization seems to end up in much weaker word selection and phrasing. There are small semantic missteps that break the illusion the LLM understands what it's writing. I find it jarring and deeply immersion breaking. I'd prefer prototype prompts on a cheaper quantized version, but I want to be able to spend 50 cents an API call to get golden output.

EnPissant • 2 days ago

The API lists o3 and o3-2025-04-16 as the same thing with the same price. The date based models are set in stone.

rfoo • 2 days ago

I don't work for OAI so obviously I can't say for them. But we don't do this.

We don't make hobbyist mistakes of randomly YOLO trying various "quantization" methods that only happen after all training and claim it a day, at all. Quantization was done before it went live.

Bjorkbat • 2 days ago

Related, when o3 finally came out ARC-AGI updated their graph because it didn’t perform nearly as well as the version of o3 that “beat” the benchmark.

https://arcprize.org/blog/analyzing-o3-with-arc-agi

beering • 2 days ago

The o3-preview test was with very expensive amounts of compute, right? I remember it was north of $10k so makes sense it did better

Bjorkbat • 2 days ago

Point remains though, they crushed the benchmark using a specialized model that you’ll probably never have access to, whether personally or through a company.

They inflated expectations and then released to the public a model that underperforms

throwaway314155 • 2 days ago

They revealed the price points for running those evaluations. IIRC the "high" level of reasoning cost tens of thousands of dollars if not more. I don't think they really inflated expectations. In fact a lot of what we learned is that ARC-AGI probably isn't a very good AGI evaluation (it claims to not be one, but the name suggests otherwise).

az226 • 2 days ago

Even classic GPT-4 from March 2023 was quantized to 4.5 bits.

smusamashah • 2 days ago

Hw about testing same input vs output with same seed on different dates. If its a different model it will return different output.

zomnoys • 2 days ago

Isn’t this not true since these models run with a non-zero temperature?

smusamashah • 2 days ago

You can set the temperature too.

resters • 2 days ago

It's probably optimized in some way, but if the optimizations degrade performance, let's hope it is reflected in various benchmarks. One alternative hypothesis is that it's the same model, but in the early days they make it think "harder" and run a meta-process to collect training data for reinforcement learning for use on future models.

SparkyMcUnicorn • 2 days ago

It's a bit dated now, but it would be cool if people submitted PRs for this one: https://aider.chat/docs/leaderboards/by-release-date.html

__mharrison__ • 2 days ago

Dated? This was updated yesterday https://aider.chat/docs/leaderboards/

SparkyMcUnicorn • 2 days ago

My link is to the benchmark results _over time_.

The main leaderboard page that you linked to is updated quite frequently, but it doesn't contain multiple benchmarks for the same exact model.

luke-stanley • 2 days ago

I think the API has some special IDs to check for reproducibility of the environment.

jstummbillig • 2 days ago

You can just give it a go for very little money (in Windsurf it's 1x right now), and see what it does. There is no room for conspiracy here, because you can simple look at what it does. If you don't like it, so won't others, and then people will not use it. People are obviously very capable of (collectively) forming opinions on models, and then vote with their wallet.

segmondy • 2 days ago

you don't, so run your own model.