Bjorkbat 2 days ago

Related, when o3 finally came out ARC-AGI updated their graph because it didn’t perform nearly as well as the version of o3 that “beat” the benchmark.

https://arcprize.org/blog/analyzing-o3-with-arc-agi

1
beering 2 days ago

The o3-preview test was with very expensive amounts of compute, right? I remember it was north of $10k so makes sense it did better

Bjorkbat 2 days ago

Point remains though, they crushed the benchmark using a specialized model that you’ll probably never have access to, whether personally or through a company.

They inflated expectations and then released to the public a model that underperforms

throwaway314155 2 days ago

They revealed the price points for running those evaluations. IIRC the "high" level of reasoning cost tens of thousands of dollars if not more. I don't think they really inflated expectations. In fact a lot of what we learned is that ARC-AGI probably isn't a very good AGI evaluation (it claims to not be one, but the name suggests otherwise).