hmottestad 2 days ago

With how amazing the first R1 model was and how little compute they needed to create it, I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Magistral Small is only 24B and scores 70.7% on AIME2024 while the 32B distill of R1 scores 72.6%. And with majority voting @64 the Magistral Small manages 83.3%, which is better than the full R1. Since I can run a 24B model on a regular gaming GPU it's a lot more accessible than the full blown R1.

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-...

2
reissbaker 1 day ago

It's not better than full R1; Mistral is using misleading benchmarks. The latest version of R1, R1-0528, is much better: 91.4% on AIME2024 pass@1. Mistral uses the original R1 release from January in their comparisons, presumably because it makes their numbers look more competitive.

That being said, it's still very impressive for a 24B.

I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Sidenote, but I'm pretty sure DeepSeek is focused on V4, and after that will train an R2 on top. The V3-0324 and R1-0528 releases weren't retrained from scratch, they just continued training from the previous V3/R1 checkpoints. They're nice bumps, but V4/R2 will be more significant.

Of course, OpenAI, Google, and Anthropic will have released new models by then too...

redman25 1 day ago

It may not have been intentionally misleading. Some benchmarks can take a lot of horsepower and time to run. Their preparation for release likely was done well in advance of the model release before the new deepseek r1 model had even been available to test.

reissbaker 1 day ago

AIME24, etc are pretty cheap to run using any DeepSeek API. Regardless, they didn't even run the benchmarks for R1 themselves, they just republished DeepSeek's published numbers from January. They could have published the ones from May, but chose not to.

adventured 2 days ago

It's because DeepSeek was a fast copy. That was the easy part and it's why they didn't have to use so much compute to get near the top. Going well beyond o3 or 2.5 Pro is drastically more expensive than fast copy. China's cultural approach to building substantial things produces this sort of outcome regularly, you see the same approach in automobiles, planes, Internet services, industrial machinery, military, et al. Innovation is very expensive and time consuming, fast copy is more often very inexpensive and rapid. 85% good enough is often good enough, that additional 10-15% is comically expensive and difficult as you climb.

orbital-decay 1 day ago

This terrible and vague stereotyping about "China" while having no clue about the subject should have no place on HN but somehow always creeps in and is upvoted by someone. DeepSeek is not "China", they had nobody to copy from, they released their first 7B reasoning model back in April 2024, it was ahead of then-SotA models in math and validated their approach. They did a ton of new things besides training a reasoning model, and likely have more to come, as they have a completely different background than most AI companies. It's more of a cross-pollination of different areas of expertise.

SoMomentary 1 day ago

I thought it had been bandied about that Deepseek had exfiltrated a bunch of data from OpenAI's models, which was then used to train theirs? Did this ultimately prove untrue? My apologies, I don't always keep up on the latest drama in the AI circles - so maybe that has been well proven wrong.

orbital-decay 1 day ago

Sam Altman threw a fit and claimed this, without providing evidence. He's... not exactly a person to trust blindly. Training on other model outputs (or at least doing sanity checks against them) is pretty common, but these models seem very different, DS has prior art, and by all signs this claim makes little sense and is hard to believe.

glomgril 1 day ago

one man's exfiltration is another man's distillation `¯\_(ツ)_/¯`

you could say they're playing by a different set of rules, but distilling from the best available model is the current meta across the industry. only they know what fraction of their post-training data is generated from openai models, but personally i'd bet my ass it's greater than zero because they are clearly competent and in their position it would have been dumb to not do this.

however you want to frame it, they have pushed the field forward -- especially in the realm of open-weight models.

natrys 2 days ago

Not disagreeing with the overarching point but:

> That was the easy part

Is a bit hand-wavy in that it doesn't explain why it's only DeepSeek who can do this "easy" thing, but still not Meta, Mistral or anyone else really. There are many other players who have way more compute than DeepSeek (even inside China, not even considering rest of the world), and I can assure you more or less everyone trains on synthetic data/distillation from whatever bigger model they can access.

refulgentis 1 day ago

They all have. I don't hope to convince you of that, everyones use case differs. Generally, AIME / prose / code benchmarks that don't involve successive tool calls are used to hide some very dark realities.

IMHO tool calling is by far the most clearly economically valuable function for an LLM, and r1 self-admittedly just...couldn't do it.

There's a lot of puff out there that's just completely misaligned with reality, ex. Gemini 2.5 Pro is by far the worst tool caller, Gemini 2.5 Flash thinking is better, 2.5 Flash is even better. And either Llama 4 beats all Gemini 2.5s except 2.5 Flash not thinking.

I'm all for "these differences will net out in the long run", Google's at least figured out how to micro optimize for Aider edit formatting without tools. Over the last 3 months, they're up 10% on edit performance. But it's horrible UX to have these specially formatted code blocks in the middle of prose. They desperately need to clean up their absurd tool-calling system. But I've been saying that for a year now. And they don't take it seriously, at all. One of their most visible leads tweeted "hey what are the best edit formats?" and a day later is tweeting the official guide for doing edits. I'm a Xoogler and that absolutely reeks of BigCo dysfunction - someone realized a problem 2 months after release and now we have "fixed" it without training, and now that's the right way to do things. Because if it isn't, well, what would we do? Shrugs

I'm also unsure how much longer it's worth giving a pass on this stuff. Everyone is competing on agentic stuff because that's the golden goose, real automation, and that needs tools. It would be utterly unsurprising to me for Google to keep missing a pain signal on this, vis a vis Anthropic, which doubled down on it mid-2024.

As long as I'm dumping info, BFCL is not a good proxy for this quality. Think "converts prose to JSON" not "file reading and editing"

natrys 1 day ago

I don't mind the info dump, but I am struggling to connect the relevance of this to topic at hand. I mean, focusing on a single specific capability and generalising it to mean "they all have" caught up with DeepSeek all across the board (which was the original topic) is a reductive and wild take. Especially when it seems to me that this seems more because of misaligned incentive than because it's truly a hard problem.

I am not really invested in this niche topic but I will observe that, yes I agree Llama 4 is really good here. And yet it's a far worse coder, far less intelligent than DeepSeek and that's not even arguable. So no it didn't "catch up" any more than what you could say by pointing out Llama is multimodal but DeepSeek isn't. That's just talking about a different things entirely.

Regardless, I do agree BFCL is not the best measure either, the Tau-bench is more real world relevant. But end of the day, most frontier labs are not incentive aligned to care about this. Meta cares because this is something Zuck personally cares about, Llama models are actually for small businesses solving grunt automation, not for random people coding at home. People like Salesforce care (xLAM), even China had GLM before DeepSeek was a thing. DeepSeek might care so long as it looks good for coding benchmarks, but that's pretty much the extent of it.

And I suspect Google doesn't truly care because in the long run they want to build everything themselves. They already have a CodeAssist product around coding which likely uses fine-tune of their mainline Gemini models to do something even more specific to their plugin.

There is a possibility that at the frontier, models are struggling to be better in a specific and constrained way, without getting worse at other things. It's either this, or even Anthropic has gone rogue because their Aider scores are way down now from before. How does that make sense if they are supposed to be all around better at agentic stuff in tool agnostic way? Then you realise they now have Claude Coder and it just makes way more economic sense to tie yourself to that, be context inefficient to your heart's content so that you can burn tokens instead of being, you know, just generally better.

refulgentis 1 day ago

> I am struggling to connect the relevance of this

> focusing on a single specific capability and

> I am not really invested in this niche topic

Right: I definitely ceded a "but it doesn't matter to me!" argument in my comment.

I sense a little "doth protest too much", in the multiple paragraphs devoted to taking that and extending it to the underpinning of automation is "irrelevant" "single" "specific", "niche".

This would also be news to DeepSeek, who put a lot of work to launch it in the r1 update a couple weeks back.

Separately, I assure you, it would be news to anyone on the Gemini team that they don't care because they want to own everything. I passed this along via DM and got "I wish :)" in return - there's been a fire drill trying to improve it via AIDER in the short term, is my understanding.

If we ignore that, and posit there is an upper management conspiracy to suppress performance, its just getting public cover by a lower upper management rush to improve scores...I guess that's possible.

Finally, one of my favorite quotes is "when faced with a contradiction, first check your premises" - to your Q about why no one can compete with DeepSeek R1 25-01, I'd humbly suggest you may be undergeneralizing, given even tool calls are "irrelevant" and "niche" to you.

Vetch 1 day ago

I think the point remains that few have been able to catch up to OpenAI. For a while it was just Anthropic. Then Google after failing a bunch of times. So, if we relax this to LLMs not by OpenAI, Anthropic or Google, then Deepseek is really the only one that's managed to reach their quality tier (even though many others have thrown their hat into the ring). We can also get approximate glimpses into which models people use by looking at OpenRouter, sorted by Top Weekly.

In the top 10, are models by OpenAI (gpt4omini), Google (gemini flashes and pros), Anthropic (Sonnets) and Deepseeks'. Even though the company list grows shorter if we instead look at top model usage grouped by order of magnitude, it retains the same companies.

Personally, the models meeting my quality bar are: gpt 4.1, o4-mini, o3, gpt2.5pro, gemini2.5flash (not 2.0), claude sonnet, deepseek and deepseek r1 (both versions). Claude Sonnet 3.5 was the first time I found LLMs to be useful for programming work. This is not to say there are no good models by others (such as Alibaba, Meta, Mistral, Cohere, THUDM, LG, perhaps Microsoft), particularly in compute constrained scenarios, just that only Deepseek reaches the Quality tier of the big 3.

natrys 1 day ago

Interesting presumption about R1 25-01 being what's talked about, you knowledge cut-off does appear to know R1 update two weeks back was a thing, and that it even improved on function calling.

Of course you have to pretend I meant the former, otherwise "they all have" doesn't entirely make sense. Not that it made total sense before either, but if I say your definition of "they" is laughably narrow, I suspect you will go back to your google contact and confirm that nothing else really exists outside it.

Oh and do a ctrl-f on "irrelevant" please, perhaps some fact grounding is in order. There was an interesting conversation to be had about underpinning of automation somehow without intelligence (Llama 4) but who has time for that if we can have hallucination go hand in hand with forced agendas (free disclaimer to boot) and projection ("doth protest too much")? Truly unforeseeable.

refulgentis 1 day ago

I don't know what you're talking about, partially because of poor grammar ("you knowledge cut-off does appear") and "presumption" (this was front and center on their API page at r1 release, and its in the r1 update notes). I sort of stopped reading after there because I realized you might be referring to me having a "knowledge cut-off", which is bizarre and also hard to understand, and it's unlikely to be particularly interesting conversation given that and the last volley relied on lots of stuff about tool calling being, inter alia, niche.

natrys 1 day ago

> you might be referring to me having a "knowledge cut-off"

Don't forget I also referred to you having "hallucination". In retrospect, likening your logical consistency to an LLM was premature, because not even gpt-3.5 era models could pull off a gem like:

> You: to your Q about why no one can compete with DeepSeek R1 25-01 blah blah blah

>> Me: ...why would you presume I was talking about 25-01 when 28-05 exists and you even seem to know it?

>>> You: this was front and center on their API page!

Riveting stuff. Few more digs about poor grammar and how many times you stopped reading, and you might even sell the misdirection.

MaxPock 2 days ago

I understand that the French are very innovative so why isn't their model SOTA ?