Item 43798656

Wasn't referring to that.

You are saying that people are using quantized models haphazardly and talking about them haphazardly. I'll grant it's not the exact same thing as making them haphazardly, but I think you took the point.

The terms shouldn't be used here. They aren't helpful. You are either getting good results or you are not. It shouldn't be treated differently from further training on dataset d. The weights changed - how much better or worse at task Y did it just get?

BoorishBears • 22 hours ago

The term is perfectly fine to use here because choosing a quantization strategy to deploy already has enough variables:

- quality for your specific application

- time to first token

- inter-token latency

- memory usage (varies even for a given bits per weight)

- generation of hardware required to run

Of those the hardest to measure is consistently "quality for your specific application".

It's so hard to measure robustly that many will take significantly worse performance on the other fronts just to not have to try to measure it... which is how you end up with full precision deployments of a 405b parameter model: https://openrouter.ai/meta-llama/llama-3.1-405b-instruct/pro...

When people are paying multiples more for compute to side-step a problem, language and technology that allows you to erase it from the equation is valid.

1 reply

danielmarkbruce • 22 hours ago

You say that as though people know these things for the full precision deployment and their use case.

Some have the capability to figure it and can do it for both full precision and quantized. Most don't and cannot.