> With o3 you get something that feels like a human-written bug report, condensed to just present the findings, whereas with Sonnet 3.7 you get something like a stream of thought, or a work log.
This is likely because the author didn't give Claude a scratchpad or space to think, essentially forcing it to mix its thoughts with its report. I'd be interested to see if using the official thinking mechanism gives it enough space to get differing results.
Having tried both I’d say o3 is in a league of it’s own compared to 3.7 or even Gemini 2.5 pro. The benchmarks may show not a lot of gain but that matters a lot when the task is very complex. What’s surprising is that they announced it last November and only now it’s released a month back now? (I’m guessing lots of safety took time but no idea). Can’t wait for o4!
All your content threads from the past months consist on you saying how much better OpenAI products are than the competition, so that doesn’t inspire a ton of trust.
Because in my use cases they are? Coding and math, science research are my primary use cases and codex with o3 and o3 consistently outperforms others in complex tasks for me. I can’t say a model is better just to appeal to HN. If another model is as good as o3 id use that in a second.
I also feel similarly. o3 feels quite distinct in what it is good at compared to other models.
For example, I think 2.5 Pro and Claude 4 are probably better at programming. But, for debugging, or not-super-well-defined reasoning tasks, or even just as a better search, o3 is in a league of its own. It feels like it can do a wider breadth of tasks than other models.
Could you provide some links to relevant work/research on using a "scratchpad" that you liked?
I'm not much of an ML engineer but I can point you to the original chain of thought paper [0] and Anthropic's docs on how to enable their official thinking scratchpad [1].
[0] https://arxiv.org/pdf/2201.11903
[1] https://docs.anthropic.com/en/docs/build-with-claude/extende...