Are there any benchmarks that track historical performance?
good question, and I don't know of any, although it's a no brainer that someone should make it.
a proxy to that may be the anecdotal evidence of users who report back in a month that model X has gotten dumber (started with gpt-4 and keeps happening, esp. with Anthro and OpenAI models). I haven't heard such anecdotal stories about Gemini, R1, etc.
Aider has one, but it hasn't been updated in months. People kept claiming models were getting worse, but the results proved that they weren't.
Updated yesterday... https://aider.chat/docs/leaderboards/
That's not the one I'm referring to. See my other comments or your sibling comment.