You would use a KV cache to cache a significant chunk of the inference work.
Using KV in the caching context is a bit confusing because it usually means key-value in the storage sense of the word (like Redis), but for LLMs, it means the key and value tensors. So IIUC, the cache will store the results of the K and V matrix multiplications for a given prompt and the only computation that needs to be done is the Q and attention calculations.
Do you mean that they provide the same answer to verbatim-equivalent questions, and pull the answer out of storage instead of recalculating each time? I've always wondered if they did this.
I bet there is a set of repetitive single, or two, question user requests that makes out a sizeable amount of all requests. The models are so expensive to run, 1% would be enough. Much less than 1%. To make it less obvious they probably have a big set of response variants. I don't see how they would not do this.
They probably also have cheap code or cheap models that normalize requests to increase cache hit rate.
The prompt may be the same but the seed is different every time.
Could you not cache the top k outputs given a provided input token set? I thought the randomness was applied at the end by sampling the output distribution.