Item 44240104

Is this what happened to Gemini 2.5 Pro? It used to be very good, but it's started struggling on basic tasks.

The thing that gets me is it seems to be lying about fetching a web page. It will say things are there that were never on any version of the page and it sometimes takes multiple screenshots of the page to convince it that it's wrong.

SparkyMcUnicorn • 2 days ago

The Aider discord community has proposed and disproven the theory that 2.5 Pro became worse, several times, through many benchmark runs.

It had a few bugs here or there when they pushed updates, but it didn't get worse.

1 reply

ants_everywhere • 2 days ago

Gemini is objectively exhibiting new behavior with the same prompts and that behavior is unwelcome. It includes hallucinating information and refusing to believe it's wrong.

My question is not whether this is true (it is) but why it's happening.

I am willing to believe the aider community has found that Gemini has maintained approximately equivalent performance on fixed benchmarks. That's reasonable considering they probably use a/b testing on benchmarks to tell them whether training or architectural changes need to be reverted.

But all versions of aider I've tested, including the most recent one, don't handle Gemini correctly so I'm skeptical that they're the state of the art with respect to bench-marking Gemini.

1 reply

SparkyMcUnicorn • 2 days ago

Gemini 2.5 Pro is the highest ranking model on the aider benchmarks leaderboard.

For benchmarks, either Gemini writes code that adheres to the required edit format, builds successfully, and passes unit tests, or it doesn't.

I primarily use aider + 2.5 pro for planning/spec files, and occasionally have it do file edits directly. Works great, other than stopping it mid-execution once in a while.

code_biologist • 2 days ago

My use case is mostly creative writing.

IMO 2.5 Pro 03-25 was insanely good. I suspect it was also very expensive to run. The 05-06 release was a huge regression in quality, most people saying it was a better coder and a worse writer. They tested a few different variants and some were less bad then others, but overall it was painful to lose access to such a good model. The just released 06-05 version seems to be uniformly better than 05-06, with far fewer "wow this thing is dumb as a rock" failure modes, but it still is not as strong as the 03-25 release.

Entirely anecdotally, 06-05 seems to exactly ride the line of "good enough to be the best, but no better than that" presumably to save costs versus the OG 03-25.

In addition, Google is doing something notably different between what you get on AI Studio versus the Gemini site/app. Maybe a different system prompt. There have been a lot of anecdotal comparisons on /r/bard and I do think the AI Studio version is better.