Item 43797532

ein0p • 1 day ago

Note that this is _way_ slower at small batch sizes you'd need for interactive use. At batch size 1 this seems to run at 1/3rd the speed of bf16 (so about 1/6th the speed of fp8 you'd realistically be using) if figure 5 is to be believed. This is actually a pretty impressive feat in itself if you know anything about GPU kernel programming, but it is much slower nevertheless. For this to work at "wire speed" it'd need hardware support, which takes years. Their "baseline" elsewhere in the paper is CPU offloading, which is dog slow and can't be made fast due to PCIe bottleneck.

timschmidt • 1 day ago

It's perfectly possible to run LLMs quickly on CPUs. An Epyc or Xeon with 12 memory channels achieves similar memory bandwidth to a 4090, which is the limiting factor. Engineering sample Epycs in kits with motherboard and RAM are available on Aliexpress for reasonable prices even.

1 reply

ein0p • 1 day ago

Did I say it wasn't? If your context is short and your model is small, it is possible to run LLMs on high-end CPUs able to support 12 channels of high-spec DDR5 RDIMMs. It's not possible to run them as fast as they'd run on a GPU equipped with HBM though. Nor would it be even remotely as energy efficient. Also, it's not possible to run LLMs quickly on CPU if your context is long, because CPUs do not have the requisite FLOPS to process long context quickly. And before you bring MoE into the conversation, MoE only affects the feedforward part of each transformer block, and full memory bandwidth and compute savings are only realized at batch size 1, sequence length 1, AKA the most inefficient mode that nobody other than Ollama users use in practice. Sequence length 8 (common for speculative decoding) could be using up to 8x37B parameters (assuming you want to run DeepSeek - the strongest available open weights model). Batch size of even 2 with sequence length 8 could use almost all parameters if you're particularly unlucky. Prompt will almost certainly use all parameters, and will slam into the FLOPS wall of your EPYC's ALUs. So can LLMs (with an emphasis on "Large") be run on CPUs? Yes. Are you going to have a good time running them this way? No.

1 reply

timschmidt • 22 hours ago

llamafile contains specific optimizations for prompt processing using AVX512 for dealing with just this issue: https://justine.lol/matmul/ (about a 10x speedup over llama.cpp)

Somewhere between 8 and 192 cores I'm sure there's enough AVX512 to get the job done. And we've managed to reinvent Intel's Larrabee / Knights concept.

Sadly, the highly optimized AVX512 kernels of llamafile don't support these exotic floats yet as far as I know.

Yes, energy efficiency per query will be terrible compared to a hyperscaler. However privacy will be perfect. Flexibility will be higher than other options - as running on the CPU is almost always possible. Even with new algorithms and experimental models.

1 reply

ein0p • 22 hours ago

At 192 cores you're way better off buying a Mac Studio, though.

ow5 • 23 hours ago

Hi! one of the contributors to the paper — we have kernels not released yet that can shave down decoding latency by >20%.

Also when we ran experiments for streaming with the current kernels, we were median ~1.3x slower at inference

1 reply

ein0p • 22 hours ago

Thanks for chiming in! How do you explain the top-most graph in Figure 5? Am I misreading it?