What's with the dropped benchmark performance compared to the original o3 release? It was disappointing to not see o4-mini on it as well
What dropped benchmark performance?
o3 scores noticeably worse on benchmarks compared to its original announcement benchmarks
Any link / source / anything? You got quite an opportunity here, OpenAI employee claiming there's no difference and you got something that shows there is.
Yes, the original announcement for o3 and o4-mini:
https://openai.com/index/introducing-o3-and-o4-mini/
o3 scored 91.6 on AIME 2024. 83.3 on GPQA
o4-mini scored 93.4, 81.4 GPQA
Then, the new announcement
https://help.openai.com/en/articles/6825453-chatgpt-release-...
o3 scored 90 on AIME 2024, 81 GPQA
o4-mini wasn't measured
---
Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with