And when you consider that the usual final step in the pipeline is that a sampler goes ham on the probabilities and just picks some random nonsense, the tolerance for lossy compression is fairly high.
In fact, there's this funny occurrence where Q4 models on occasion perform better than their fp16 counterparts on benchmarks ran with top_k=1 since the outputs are slightly more random and they can less deterministically blunder past the local maximum into a more correct solution.
We got an oral at ICLR for calling out how shit samplers like top_p and top_k are. Use min_p!
True yep, I wish more people benchmarked models with more representative sampler settings and then took the average of 5 or 10 responses.