As an occasional user of Mistral, I find their model to give generally excellent results and pretty quickly. I think a lot of teams are now overly focused on winning the benchmarks while producing worse real results.
If so we need to fix the benchmarks.
I think there's a fundamental limit to benchmarks when it comes to real-world utility. The best option would be more like a user survey.
That's Chatbot Arena: https://lmarena.ai/leaderboard
And unfortunately revealed to be largely a vibe check these days with that whole Llama 4 debacle. But why should we be surprised, really, when users have an easier time feeling if the replies sound human and conversational and _appear_ knowledgeable than actually outsmarting them. This Arena worked well in the ChatGPT 3.0 days… But now?
those who try to fix them are fighting alone against huge corps which try to abuse them..