roboboffin 2 days ago

I think that their point was that the problem is easily solvable by humans without code, and shows the ability to chain steps together to achieve a goal.

2
jwitthuhn 2 days ago

Is it easily solvable by humans without code? I suspect if you asked a human to write down all the steps in order to solve a Tower of Hanoi with 12 disks they would also give up before completing it. Writing code that produces the correct output is the only realistic way to solve that purely due to the amount of output required.

roboboffin 2 days ago

Not sure why I am being downvoted. I am simply saying that we know there is a defined algorithm for solving Tower of Hanoi, and the source code for it is widely available. So, o3 producing the code as an answer, demonstrates even less intelligence, as it means it is either memorized or copied from the internet. I don't see how this point counters the paper at all.

I believe what they are trying to show in that paper, is that as the chain of operations approaches a large amount (their proxy for complexity), an LLM will inevitable fail. Humans don't have infinite context either, but they can still solve the Tower Of Hanoi without need to resort to either pen or paper, or coding.

syntex 2 days ago

I didn't downvote. T the problem with the paper is that it asks the model to output all moves for, say, 15 disks 2 ^ 15 - 1 = 32767

32767 moves in a single prompt. That's not testing reasoning. That’s testing whether the model can emit a huge structured output without error, under a context window limit.

The authors then treat failure to reproduce this entire sequence as evidence that the model can't reason. But that’s like saying a calculator is broken because its printer jammed halfway through printing all prime numbers under 10000.

For me o3 returning Python code isn’t a failure. It’s a smart shortcut. The failure is in the benchmark design. This benchmark just smells.

daveguy 2 days ago

> That’s testing whether the model can emit a huge structured output without error, under a context window limit.

Agreed. But to be fair, 1) a relatively simple algorithm can do it, and more importantly 2) a lot of people are trying to build products around doing exactly this (emit large structured output without error).

roboboffin 2 days ago

No worries, I wasn’t saying to you directly.

I agree 15 disks is very difficult for a human, probably on a sheer stamina level; but I managed to do 8 in about 15 minutes by playing around (I.e. no practice). They do state that there is a massive drop in performance at this point.

teach 2 days ago

Remember that with Towers of Hanoi every extra disk doubles the number of moves required. So 15 discs is 128x more moves. If you did eight in 15m then fifteen would take you 32 hours.