amanda99 3 days ago

You would use a KV cache to cache a significant chunk of the inference work.

2
xmprt 2 days ago

Using KV in the caching context is a bit confusing because it usually means key-value in the storage sense of the word (like Redis), but for LLMs, it means the key and value tensors. So IIUC, the cache will store the results of the K and V matrix multiplications for a given prompt and the only computation that needs to be done is the Q and attention calculations.

biophysboy 3 days ago

Do you mean that they provide the same answer to verbatim-equivalent questions, and pull the answer out of storage instead of recalculating each time? I've always wondered if they did this.

Traubenfuchs 2 days ago

I bet there is a set of repetitive single, or two, question user requests that makes out a sizeable amount of all requests. The models are so expensive to run, 1% would be enough. Much less than 1%. To make it less obvious they probably have a big set of response variants. I don't see how they would not do this.

They probably also have cheap code or cheap models that normalize requests to increase cache hit rate.

koakuma-chan 3 days ago

The prompt may be the same but the seed is different every time.

biophysboy 2 days ago

Could you not cache the top k outputs given a provided input token set? I thought the randomness was applied at the end by sampling the output distribution.