Item 44242893

o3 probably used to have a HUGE profit margin on inference, so I'd say it's unclear how much optimo was done;

I find it pretty plausible they got an 80% speedup just by making optimized kernels for everything. Even when GPUs say they're being 100% utilized, there are so many improvements to be made, like:

- Carefully interleaving shared memory loading with computation, and the whole kernel with global memory loading.

- Warp shuffling for softmax.

- Avoiding memory access conflicts in matrix multiplication.

I'm sure the guys at ClosedAI have many more optimizations they've implemented ;). They're probably eventually going to design their own chips or use photonic chips for lower energy costs, but there's still a lot of gains to be made in the software.

1 reply

Davidzheng • 2 days ago

yes I agree that it is very plausible. But it's just unclear whether it is more of a business decision or a real downstream effect of engineering optimizations (which I assume are happening everyday at OA)