> Go ask the operator of a Chinese room to do some math they weren't taught in school, and see if the translation guide helps.
That analogy only holds if LLMs can solve novel problems that can be proven to not exist in any form in their training material.
They do. Spend some time using a modern reasoning model. There is a class of interesting problems, nestled between trivial ones whose answers can simply be regurgitated and difficult ones that either yield nonsense or involve tool use, that transformer networks can absolutely, incontrovertibly reason about.
Have any LLMs solved any of the big (or even lesser known) unanswered problems in math, physics, computer science?
It may appear that they are solving novel problems but given the size of their training set they have probably seen them. There are very few questions a person can come up with that haven't already been asked and answered somewhere.
Google's AlphaEvolve recently produced a novel matrix multiplication function slightly faster than the previous state of the art that couldn't have been in any training data. While not a hard unsolved problem, I think it's good evidence that an LLM is capable of synthesizing new solutions to problems.
Reason about: sure. Independently solve novel ones without extreme amounts of guidance: I have yet to see it.
Granted, for most language and programming tasks, you don’t need the latter, only the former.
99.9% of humans will never solve a novel problem. It's a bad benchmark to use here
But they will solve a problem novel to them, since they haven't read all of the text that exists.
I agree. But it’s worth being somewhat skeptical of ASI scenarios if you can’t, for example, give a well formulated math problem to a LLM and it cannot solve it. Until we get a Reimann hypothesis calculator (or equivalent for hard/old unsolved maths) it’s kind of silly to be debating the extreme ends of AI cognition theory
"I'm taking this talking dog right back to the pound. It completely whiffed on both Riemann and Goldbach. And you should see the buffer overflows in the C++ code it wrote for me."
I have been able to get chatgpt to synthesize in the edges of two domains in ideaspace, say, psychology and economics, but surprisingly it struggled helping me write ODE code in go. In the first case, I think it actually synthesized. In the latter it couldn't extrapolate enough ideas from the two fields into one.
How can you distinguish "I think it did something really impressive in the first case but not the second" from "it spat out something that looked interesting in both cases but in the latter case there was an objective criteria that exposed a lack of true understanding"?
It's famously easier to impress people with soft-sciences speculation than it is to impress the rules of math or compilers.
I think people give training data too much credit. Obviously it's important, but it also isn't a database of knowledge like it's made out to be.
You can see this in riddles that are obviously in the training set, but older or lighter models still get them wrong. Or situations where the model gets them right, but uses a different method than the ones used in the training set.