Item 44239622

esafak • 2 days ago

Are there any benchmarks that track historical performance?

behnamoh • 2 days ago

good question, and I don't know of any, although it's a no brainer that someone should make it.

a proxy to that may be the anecdotal evidence of users who report back in a month that model X has gotten dumber (started with gpt-4 and keeps happening, esp. with Anthro and OpenAI models). I haven't heard such anecdotal stories about Gemini, R1, etc.

SparkyMcUnicorn • 2 days ago

Aider has one, but it hasn't been updated in months. People kept claiming models were getting worse, but the results proved that they weren't.

2 replies

esafak • 2 days ago

https://aider.chat/docs/leaderboards/by-release-date.html

__mharrison__ • 2 days ago

Updated yesterday... https://aider.chat/docs/leaderboards/

2 replies

vitaflo • 2 days ago

That Deepseek price is always hilarious to see in these charts.

SparkyMcUnicorn • 2 days ago

That's not the one I'm referring to. See my other comments or your sibling comment.