yjftsjthsd-h 1 day ago

> Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models.

The context length alone probably makes it worthwhile even if your models fit in memory, but I'm curious if it improves tokens/sec even all on GPU, since in my very amateur understanding LLMs tend to be constrained by memory bandwidth?

3
brigade 14 hours ago

It does not; the decompression is memory to memory, one tensor at a time, so it’s worse. They claim less than 200 GB/s on an A100, and their benchmarks suggest it’s somewhere between 1.5-4x slower at batch size 1 depending on GPU and model. This overhead of course mostly disappears with a large enough batch size.

Other lossless codecs can hit 600 GB/s on the same hardware, so there should be some room for improvement. But A100’s raw memory bandwidth is 1.6 TB/s

philjohn 1 day ago

My mental model is saying it might do, much like on slow hard drives DoubleSpace in DOS slightly sped up loading data from disk.

hnuser123456 23 hours ago

If the model is 70% the size, it will be 1/0.7 = 1.43x the speed.