I think benchmark targeting is going to be a serious problem going forward. The recent Nate Silver podcast on poker performance is interesting. Basically, the LLM models still suck at playing poker.
Poker tests intelligence. So what gives? One interesting thing is that for whatever reason, poker performance isn't used a benchmark in the LLM showdown between big tech companies.
The models have definitely improved in the past few years. I'm skeptical that there's been a "break-through", and I'm growing more skeptical of the exponential growth theory. It looks to me like the big tech companies are just throwing huge compute and engineering budgets at the existing transformer tech, to improve benchmarks one by one.
I'm sure if Google allocated 10 engineers a dozen million dollars to improve Gemini's poker performance, it would increase. The idea before AGI and the exponential growth hypothesis is that you don't have to do that because the AI gets smarter in a general sense all on it's own.
I think that's generally fair, but this point goes too far:
> improve benchmarks one by one
If you're right about that in the strong sense — that each task needs to be optimised in total isolation — then it would be a longer, slower road to a really powerful humanlike system.
What I think is really happening though that each specific task (eg. coding) is having large spillover effects on other areas (eg. helping them to be better at extended verbal reasoning even when not writing any code). The AI labs can't do everything at once, so they're focusing where:
- It's easy to generate more data and measure results (coding, maths etc.) - There's a relative lack of good data in the existing training corpus (eg. good agentic reasoning logic - the kinds of internal monologs that humans rarely write down) - Areas where it would be immediately useful for the models to get better in a targeted way (eg. agentic tool-use; developing great hypothesis generation instincts in scientific fields like algorithm design, drug discovery and ML research)
By the time those tasks are optimised, I suspect the spill over effects will be substantial and the models will generally be much more capable.
Beyond that, the labs are all pretty open about the fact that they want to use the resulting AI talents for coding, reasoning and research skills to accelerate their own research. If that works (definitely not obvious yet) then finding ways to train a much broader array of skills could be much faster because that process itself would be increasingly automated.