reissbaker 1 day ago

It's not better than full R1; Mistral is using misleading benchmarks. The latest version of R1, R1-0528, is much better: 91.4% on AIME2024 pass@1. Mistral uses the original R1 release from January in their comparisons, presumably because it makes their numbers look more competitive.

That being said, it's still very impressive for a 24B.

I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Sidenote, but I'm pretty sure DeepSeek is focused on V4, and after that will train an R2 on top. The V3-0324 and R1-0528 releases weren't retrained from scratch, they just continued training from the previous V3/R1 checkpoints. They're nice bumps, but V4/R2 will be more significant.

Of course, OpenAI, Google, and Anthropic will have released new models by then too...

1
redman25 1 day ago

It may not have been intentionally misleading. Some benchmarks can take a lot of horsepower and time to run. Their preparation for release likely was done well in advance of the model release before the new deepseek r1 model had even been available to test.

reissbaker 1 day ago

AIME24, etc are pretty cheap to run using any DeepSeek API. Regardless, they didn't even run the benchmarks for R1 themselves, they just republished DeepSeek's published numbers from January. They could have published the ones from May, but chose not to.