MattDaEskimo 2 days ago

o3 scores noticeably worse on benchmarks compared to its original announcement benchmarks

1
refulgentis 2 days ago

Any link / source / anything? You got quite an opportunity here, OpenAI employee claiming there's no difference and you got something that shows there is.

MattDaEskimo 2 days ago

Yes, the original announcement for o3 and o4-mini:

https://openai.com/index/introducing-o3-and-o4-mini/

o3 scored 91.6 on AIME 2024. 83.3 on GPQA

o4-mini scored 93.4, 81.4 GPQA

Then, the new announcement

https://help.openai.com/en/articles/6825453-chatgpt-release-...

o3 scored 90 on AIME 2024, 81 GPQA

o4-mini wasn't measured

---

Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with