Item 44238057

tootie • 2 days ago

As an occasional user of Mistral, I find their model to give generally excellent results and pretty quickly. I think a lot of teams are now overly focused on winning the benchmarks while producing worse real results.

esafak • 2 days ago

If so we need to fix the benchmarks.

3 replies

paulddraper • 2 days ago

https://en.wikipedia.org/wiki/Goodhart%27s_law

tootie • 2 days ago

I think there's a fundamental limit to benchmarks when it comes to real-world utility. The best option would be more like a user survey.

1 reply

esafak • 2 days ago

That's Chatbot Arena: https://lmarena.ai/leaderboard

1 reply

jug • 1 day ago

And unfortunately revealed to be largely a vibe check these days with that whole Llama 4 debacle. But why should we be surprised, really, when users have an easier time feeling if the replies sound human and conversational and _appear_ knowledgeable than actually outsmarting them. This Arena worked well in the ChatGPT 3.0 days… But now?

riku_iki • 2 days ago

those who try to fix them are fighting alone against huge corps which try to abuse them..