The o3-preview test was with very expensive amounts of compute, right? I remember it was north of $10k so makes sense it did better
Point remains though, they crushed the benchmark using a specialized model that you’ll probably never have access to, whether personally or through a company.
They inflated expectations and then released to the public a model that underperforms
They revealed the price points for running those evaluations. IIRC the "high" level of reasoning cost tens of thousands of dollars if not more. I don't think they really inflated expectations. In fact a lot of what we learned is that ARC-AGI probably isn't a very good AGI evaluation (it claims to not be one, but the name suggests otherwise).