Item 44243204

> I am struggling to connect the relevance of this

> focusing on a single specific capability and

> I am not really invested in this niche topic

Right: I definitely ceded a "but it doesn't matter to me!" argument in my comment.

I sense a little "doth protest too much", in the multiple paragraphs devoted to taking that and extending it to the underpinning of automation is "irrelevant" "single" "specific", "niche".

This would also be news to DeepSeek, who put a lot of work to launch it in the r1 update a couple weeks back.

Separately, I assure you, it would be news to anyone on the Gemini team that they don't care because they want to own everything. I passed this along via DM and got "I wish :)" in return - there's been a fire drill trying to improve it via AIDER in the short term, is my understanding.

If we ignore that, and posit there is an upper management conspiracy to suppress performance, its just getting public cover by a lower upper management rush to improve scores...I guess that's possible.

Finally, one of my favorite quotes is "when faced with a contradiction, first check your premises" - to your Q about why no one can compete with DeepSeek R1 25-01, I'd humbly suggest you may be undergeneralizing, given even tool calls are "irrelevant" and "niche" to you.

Vetch • 1 day ago

I think the point remains that few have been able to catch up to OpenAI. For a while it was just Anthropic. Then Google after failing a bunch of times. So, if we relax this to LLMs not by OpenAI, Anthropic or Google, then Deepseek is really the only one that's managed to reach their quality tier (even though many others have thrown their hat into the ring). We can also get approximate glimpses into which models people use by looking at OpenRouter, sorted by Top Weekly.

In the top 10, are models by OpenAI (gpt4omini), Google (gemini flashes and pros), Anthropic (Sonnets) and Deepseeks'. Even though the company list grows shorter if we instead look at top model usage grouped by order of magnitude, it retains the same companies.

Personally, the models meeting my quality bar are: gpt 4.1, o4-mini, o3, gpt2.5pro, gemini2.5flash (not 2.0), claude sonnet, deepseek and deepseek r1 (both versions). Claude Sonnet 3.5 was the first time I found LLMs to be useful for programming work. This is not to say there are no good models by others (such as Alibaba, Meta, Mistral, Cohere, THUDM, LG, perhaps Microsoft), particularly in compute constrained scenarios, just that only Deepseek reaches the Quality tier of the big 3.

natrys • 1 day ago

Interesting presumption about R1 25-01 being what's talked about, you knowledge cut-off does appear to know R1 update two weeks back was a thing, and that it even improved on function calling.

Of course you have to pretend I meant the former, otherwise "they all have" doesn't entirely make sense. Not that it made total sense before either, but if I say your definition of "they" is laughably narrow, I suspect you will go back to your google contact and confirm that nothing else really exists outside it.

Oh and do a ctrl-f on "irrelevant" please, perhaps some fact grounding is in order. There was an interesting conversation to be had about underpinning of automation somehow without intelligence (Llama 4) but who has time for that if we can have hallucination go hand in hand with forced agendas (free disclaimer to boot) and projection ("doth protest too much")? Truly unforeseeable.

1 reply

refulgentis • 1 day ago

I don't know what you're talking about, partially because of poor grammar ("you knowledge cut-off does appear") and "presumption" (this was front and center on their API page at r1 release, and its in the r1 update notes). I sort of stopped reading after there because I realized you might be referring to me having a "knowledge cut-off", which is bizarre and also hard to understand, and it's unlikely to be particularly interesting conversation given that and the last volley relied on lots of stuff about tool calling being, inter alia, niche.

1 reply

natrys • 1 day ago

> you might be referring to me having a "knowledge cut-off"

Don't forget I also referred to you having "hallucination". In retrospect, likening your logical consistency to an LLM was premature, because not even gpt-3.5 era models could pull off a gem like:

> You: to your Q about why no one can compete with DeepSeek R1 25-01 blah blah blah

>> Me: ...why would you presume I was talking about 25-01 when 28-05 exists and you even seem to know it?

>>> You: this was front and center on their API page!

Riveting stuff. Few more digs about poor grammar and how many times you stopped reading, and you might even sell the misdirection.