SparkyMcUnicorn 2 days ago

The Aider discord community has proposed and disproven the theory that 2.5 Pro became worse, several times, through many benchmark runs.

It had a few bugs here or there when they pushed updates, but it didn't get worse.

1
ants_everywhere 2 days ago

Gemini is objectively exhibiting new behavior with the same prompts and that behavior is unwelcome. It includes hallucinating information and refusing to believe it's wrong.

My question is not whether this is true (it is) but why it's happening.

I am willing to believe the aider community has found that Gemini has maintained approximately equivalent performance on fixed benchmarks. That's reasonable considering they probably use a/b testing on benchmarks to tell them whether training or architectural changes need to be reverted.

But all versions of aider I've tested, including the most recent one, don't handle Gemini correctly so I'm skeptical that they're the state of the art with respect to bench-marking Gemini.

SparkyMcUnicorn 2 days ago

Gemini 2.5 Pro is the highest ranking model on the aider benchmarks leaderboard.

For benchmarks, either Gemini writes code that adheres to the required edit format, builds successfully, and passes unit tests, or it doesn't.

I primarily use aider + 2.5 pro for planning/spec files, and occasionally have it do file edits directly. Works great, other than stopping it mid-execution once in a while.