I don't know how someone could be following the technical progress in detail and hold this view. The progress is astonishing, and the benchmarks are becoming saturated so fast that it's hard to keep track.
Are there plenty of gaps left between here and most definitions of AGI? Absolutely. Nevertheless, how can you be sure that those gaps will remain given how many faculties these models have already been able to excel at (translation, maths, writing, code, chess, algorithm design etc.)?
It seems to me like we're down to a relatively sparse list of tasks and skills where the models aren't getting enough training data, or are missing tools and sub-components required to excel. Beyond that, it's just a matter of iterative improvement until 80th percentile coder becomes 99th percentile coder becomes superhuman coder, and ditto for maths, persuasion and everything else.
Maybe we hit some hard roadblocks, but room for those challenges to be hiding seems to be dwindling day by day.
I think benchmark targeting is going to be a serious problem going forward. The recent Nate Silver podcast on poker performance is interesting. Basically, the LLM models still suck at playing poker.
Poker tests intelligence. So what gives? One interesting thing is that for whatever reason, poker performance isn't used a benchmark in the LLM showdown between big tech companies.
The models have definitely improved in the past few years. I'm skeptical that there's been a "break-through", and I'm growing more skeptical of the exponential growth theory. It looks to me like the big tech companies are just throwing huge compute and engineering budgets at the existing transformer tech, to improve benchmarks one by one.
I'm sure if Google allocated 10 engineers a dozen million dollars to improve Gemini's poker performance, it would increase. The idea before AGI and the exponential growth hypothesis is that you don't have to do that because the AI gets smarter in a general sense all on it's own.
I think that's generally fair, but this point goes too far:
> improve benchmarks one by one
If you're right about that in the strong sense — that each task needs to be optimised in total isolation — then it would be a longer, slower road to a really powerful humanlike system.
What I think is really happening though that each specific task (eg. coding) is having large spillover effects on other areas (eg. helping them to be better at extended verbal reasoning even when not writing any code). The AI labs can't do everything at once, so they're focusing where:
- It's easy to generate more data and measure results (coding, maths etc.) - There's a relative lack of good data in the existing training corpus (eg. good agentic reasoning logic - the kinds of internal monologs that humans rarely write down) - Areas where it would be immediately useful for the models to get better in a targeted way (eg. agentic tool-use; developing great hypothesis generation instincts in scientific fields like algorithm design, drug discovery and ML research)
By the time those tasks are optimised, I suspect the spill over effects will be substantial and the models will generally be much more capable.
Beyond that, the labs are all pretty open about the fact that they want to use the resulting AI talents for coding, reasoning and research skills to accelerate their own research. If that works (definitely not obvious yet) then finding ways to train a much broader array of skills could be much faster because that process itself would be increasingly automated.