Nope, not what we’re doing.
o3 is still o3 (no nerfing) and o3-pro is new and better than o3.
If we were lying about this, it would be really easy to catch us - just run evals.
(I work at OpenAI.)
Anecdotal, but about a week ago I noticed a sharp drop in o3 performance. For many tasks I will compare Gemini 2.5 Pro with o3, running the same prompt in both. Generally for my personal use o3 and G2.5P have been neck-and neck over the last months, with responses I have been very happy with.
However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).
This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.
Are you sure you're using the same models? G2.5P updated almost exactly a week ago.
G2.5P might've updated, but that's not the model I noticed a difference. o3 seemed noticeably dumber in isolation, not just compared to G2.5P.
But yes, perhaps the answer is that about a week ago I started asking subconsciously harder questions, and G2.5P handled them better because it had just been improved, while o3 had not so it seemed worse. Or perhaps G2.5P has always had more capacity than o3, and I wasn't asking hard enough questions to notice a difference before.
Unrelated: Can you all come up with a better naming scheme for your models? I feel like this is a huge UX miss.
o4-mini-high o4-mini o3 o3-pro gpt-4o
Oy.
Is it o3 (low), o3 (medium) or o3 (high)? Different model names have crept into the various benchmarks over the last few months.
o3 is a model, and reasoning effort (high/medium/low) is a parameter that goes into the model.
o3 pro is a different thing - it’s not just o3 with maximum remaining effort.
Why's it called o3 then if it's a different thing? There's already a rather extreme amount of confusion with the model names and it's not clear _at all_ which model would be "the best" in terms of response quality.
Here's the current state with version numbers as far as I can piece it together (using my best guess at naming of each component of the version identifier. Might be totally wrong tho):
1) prefix (optional): "gpt-", "chatgpt-"
2) family (required): o1, o3, o4, 4o, 3.5, 4, 4.1, 4.5,
3) quality? (optional): "nano", "mini", "pro", "turbo"
4) type (optional): "audio", "search"
5) lifecycle (optional): "preview", "latest"
6) date (optional): 2025-04-14, 2024-05-13, 1106, 0613, 0125, etc (I assume the last ones are a date without a year for 2024?)
7) size (optional): "16k"
Some final combinations of these version number components are as small as 1 ("o3") or as large as 6 ("gpt-4o-mini-search-preview-2024-12-17").
Given this mess, I can't blame people assuming that the "best" model is the one with the "biggest" number, which would rank the model families as: 4.5 (best) > 4.1 > 4 > 4o > o4 > 3.5 > o3 > o1 (worst).
o3 pro is based on o3 and its style and outputs will be quite similar to o3.
As an analogy, think of it like this:
o3-low ~ Ford Mustang with the accelerator gently pressed
o3-medium ~ Ford Mustang with the accelerator pressed
o3-high ~ Ford Mustang with the accelerator heavily pressed
o3 pro ~ Ford Mustang GT
Even though a Mustang GT is a different car than a Mustang, you don’t give it a totally different name (eg Palomino). The similarity in name signals it has a lot of the same characteristics but a souped up engine. Same for o3 pro.
Fun fact: before GPT-4, we had a unified naming scheme for models that went {modality}-{size}-{version}, which resulted in names like text-davinci-002. We considered launching GPT-4 as something like text-earhart-001, but since everyone was calling it GPT-4 anyway, we abandoned that system to use the name GPT-4 that everyone had already latched onto. Kind of funny how our original unified naming scheme made room for 999 versions, but we didn't make it past 3.
Edit: When I say the Mustang GT is a different car than a Mustang - I mean it literally. If you bought a Mustang GT and someone delivered a Mustang with a different trim, you wouldn't say "great, this is just what I ordered, with the same features/behavior/value." That we call it a different trim is a linguistic choice to signal to consumers that it's very similar, and built on the same production line, but comes with a different engine or different features. Similar to o3 pro.
Can you elaborate on what you mean that o3 pro is a GT? In particular I don't understand how to reconcile what you're saying that o3 pro is in some way fundamentally different from o3 (albeit based on o3) with this tweet:
> As o3-pro uses the same underlying model as o3, full safety details can be found in the o3 system card.
Yeah, I totally get the confusion here. Unfortunately I can't give the recipe behind our models, so there's going to be some irreducible blurriness here, but the following statements are all true:
- o3 pro is based on o3
- o3 pro uses the same underlying model as o3
- o3 pro is similar to o3, but is a distinct thing that's smarter and slower
- o3 pro is not o3 with longer reasoning
In my analogy, o3 pro vs o3 is more than just an input parameter (e.g., not just the accelerator input) but less than a full difference in model (e.g., Ford Mustang vs F150). It's in between, kind of like car trim with the same body but a stronger engine. Imperfect analogy, and I apologize if this doesn't feel like it adds any clarity. At the end of the day, it doesn't really matter how it works - what matters is if people find it worth using.
This analogy might work better if the Mustang GT weren't, in fact, the same car as the Mustang. It's just a trim level, not a different car.
My guess is this comes from an org structure where you have multiple "pods" working on different research. Who comes up with the next shippable model and when that happens is kind of random and the chaotic naming system comes from that. It's just my speculation and could be wildly wrong.
Could someone there maybe possibly use, oh I dunno, ChatGPT and come up with some better product names?
What's with the dropped benchmark performance compared to the original o3 release? It was disappointing to not see o4-mini on it as well
What dropped benchmark performance?
o3 scores noticeably worse on benchmarks compared to its original announcement benchmarks
Any link / source / anything? You got quite an opportunity here, OpenAI employee claiming there's no difference and you got something that shows there is.
Yes, the original announcement for o3 and o4-mini:
https://openai.com/index/introducing-o3-and-o4-mini/
o3 scored 91.6 on AIME 2024. 83.3 on GPQA
o4-mini scored 93.4, 81.4 GPQA
Then, the new announcement
https://help.openai.com/en/articles/6825453-chatgpt-release-...
o3 scored 90 on AIME 2024, 81 GPQA
o4-mini wasn't measured
---
Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with
Just because you work at openAI doesn't mean you know everything about openAI especially as strategic as nerfing models to save costs
Not quantized?
Not quantized. Weights are the same.
If we did change the model, we'd release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we ever silently changed models, so we never do this [1].
[1] `chatgpt-4o-latest` being an explicit exception
>we'd release it as a new model with a new name
Speaking of a new name. I'll donate the API credits to run a "choose a naming scheme for AI models that isn't confusing AF" for OpenAI.
Google could at least learn something from this attitude, given their recent 03-25 -> 05-06 model alias switcharoo with 0 notice :)
That is a preview / beta model with no expectation of stability. Google did nothing wrong there. No one should be using a preview model in production.
Hard disagree. Of course technically they didn't do anything explicitly against the public guidance (the checks and balances would never let them), but naming a model with a date very strongly implies immutability.
It's the same logic of why UB in C/C++ isn't a license to do whatever the compiler wants. We're humans and we operate on implications, common-sense assumptions and trust.
The model is labelled as Preview. There are no guarantees of stability or availability for Preview models. Not intended for production workloads.
https://cloud.google.com/products?hl=en#product-launch-stage...
"At Preview, products or features are ready for testing by customers. Preview offerings are often publicly announced, but are not necessarily feature-complete, and no SLAs or technical support commitments are provided for these. Unless stated otherwise by Google, Preview offerings are intended for use in test environments only. The average Preview stage lasts about six months."
There hasn't been a non-preview Gemini since...November? The previews are the same as everyone else's release cadance, "preview" is just a magic wand that meant the Launchcal (google's internal signoff tool, i.e. "wave will never happen again) needs less signoffs. Then it got to the point date-pinned models were getting swapped in, in the name of doing us a favor, and it's a...novel idea, we can both agree at the least.
I bet someone at Google would be a bit surprised to see someone jumping to legalese to act like this...novelty...is inherently due to the preview status, and based on anything more than a sense that there's no net harm done to us if it costs the same and is better.
I'm not sure they're wrong.
But it also leads to a sort of "nobody knows how anything works because we have 2^N configs and 5 bits" - for instance, 05-06 was also upgraded to 06-05. Except it wasn't, if you sent variable thinking to 05-06 after upgrade it'd fail. (and don't get me started on the 5 different thinking configurations for Gemini 2.5 flash thinking vs. gemini 05-06 vs. 06-05 and 0 thinking)
I honestly have no idea what you are trying to say.
It's a preview model - for testing only, not for production. Really not that complicated.
So you don't have anything to contribute beyond, and aren't interested in anything beyond, citing of terms?
Why are you in the comments section of a engineering news site?
(note: beyond your, excuse me while I'm direct now, boorish know-nothing reply, the terms you are citing have nothing to do with the thing people are actually discussing around you, despite your best efforts. It doesn't say "we might swap in a new service, congrats!", nor does it have anything to say about that. Your legalese at most describes why they'd pull 05-06, not forward 05-06 to 06-05. This is a novel idea.)
This case was simply a matter of people not understanding the terms of service. There is nothing more to be said. It's that simple. The "engineers" should know that before deploying to prod. Basic competence.
And I mean I genuinely do not understand what you are trying to say. Couldn't parse it.
> And I mean I genuinely do not understand what you are trying to say. Couldn't parse it.
It’s always worth considering that this may be your problem. If you still don’t get it, the only valuable reply is one which asks a question. Also, including “it’s not that complicated” only serves to inflame.
John, do you understand that the thing you're quoting says "We reserve the right to pull things", not "We reserve the right to swap in a new service"?
Do you understand that even if it did say that, that wasn't true either? It was some weird undocumentable half-beast?
I have exactly your attitude about their cavalier use of preview for all things Gemini, and even people's use of the preview models.
But I've also been on this site for 15 years and am a bit wow'd by your interlocution style here -- it's quite rare to see someone flip "the 3P provider swapped the service on us!" into "well they said they could turn it off, of course you should expect it to be swapped for the first time ever!" insert dull sneer about the quality of other engineers
How is this so hard to understand? It's a preview service for testing only, not intended for production.
I am done with this thread. We are going around in circles.
Well, no. Well, sure. You're done, but we're not going in circles. It'd just do too much damage to you to have to answer the simple question "Where does the legalese say they can swap in a new service?", so you have to pretend this is circular and just all-so-confusing, de facto, we have to pretend it is confusing and/or obviously wrong to use any Gemini 2+ at all.
It's a cute argument, as I noted, I'm emotionally sympathetic to it even, it's my favorite "get off my lawn." However, I've also been on the Internet long enough to know you write back, at length, when people try anti-intellectualism and why-are-we-even-talking-about-this as interaction.
https://cloud.google.com/terms/service-terms
"b. Disclaimer. PRE-GA OFFERINGS ARE PROVIDED “AS IS” WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES OR REPRESENTATIONS OF ANY KIND. Pre-GA Offerings (i) may be changed, suspended or discontinued at any time without prior notice to Customer and (ii) are not covered by any SLA or Google indemnity. Except as otherwise expressly indicated in a written notice or Google documentation, (A) Pre-GA Offerings are not covered by TSS, and (B) the Data Location Section above will not apply to Pre-GA Offerings."
There's a very large gulf between "what makes sense to Google" and "what makes sense to Human Beings". I have so many rants about Google's poor treatment of "customers" that they feel like Oracle to me now. Like every time I use them, I'm really just falling prey to my own misguided idea that this time I won't get screwed over.
The users aren't random "human beings" in this case. They are professional software developers who are expected to understand the basics. Deploying that model into production shows a lack of basic competence. It is clearly marked "preview" and is for test only.
That may be true, but it doesn't make the customer's claims not true. What Google did was counter-intuitive. That's a fact. Pointing at some fine print and saying "uhh actually, technically it's your stupid human brain is the problem, not us! we technically are allowed to do anything we want, just look at the fine print!!" does not make things better. We are human beings; we are flawed. That much should be obvious to any human organization. If you don't know how to make things that don't piss off human beings, the problem isn't with the humans.
If the "preview release" you were using was v0.3, and suddenly it started being v0.6 without warning, that would be insane. The only point of providing a version number is to give people an indicator of consistency. The datestamp is a version number. If they didn't want us to expect consistency, they should not have given it a version number. That's the whole point of rolling release branches, they have no version. You don't have "v2.0" of a rolling release, you just have "latest". They fucked up by giving it a datestamp.
This is an extremely old and well-known problem with software interfaces. Either you version it or you don't. If you do version it, and change it, you change the version, and give people dependent on the old version some time to upgrade. Otherwise it breaks things, and that pisses people off. The alternative is not versioning it, which is a signal that there is no consistency to be expected. Any decent software developer should have known all this.
And while I'm at it: what's with the name flip-flopping? In 2014, GCP issued a PR release explaining It was no longer using "Preview", but "Alpha" and "Beta" (https://cloudplatform.googleblog.com/2014/10/new-release-pha...). But the link you showed earlier says "Alpha" and "Beta" are now deprecated. But no PR release? I guess that's our bad for not constantly reading the fine print and expecting it to revert back to something from 11 years ago.
It was definitely annoying when o1 disappeared over night, my impression is that was better at some tasks than o3.
I think the parent-parent poster has explained why we can't trust you (and work on OpenAI doesn't help they way you think it does).
I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.
This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.
An (arbitrarily) quantized model is a totally different model, compared to the original.
I'm not totally sure how you at this point in your online presence associate someone stating their job as a "brag" and not what it really is, providing transparency/disclosure before stating their thoughts.
This is HN and not reddit.
"I didn't read the ToS, like everyone else, but my guess..."
Ah, there it is.