biophysboy 2 days ago

Do you mean that they provide the same answer to verbatim-equivalent questions, and pull the answer out of storage instead of recalculating each time? I've always wondered if they did this.

2
Traubenfuchs 2 days ago

I bet there is a set of repetitive single, or two, question user requests that makes out a sizeable amount of all requests. The models are so expensive to run, 1% would be enough. Much less than 1%. To make it less obvious they probably have a big set of response variants. I don't see how they would not do this.

They probably also have cheap code or cheap models that normalize requests to increase cache hit rate.

koakuma-chan 2 days ago

The prompt may be the same but the seed is different every time.

biophysboy 2 days ago

Could you not cache the top k outputs given a provided input token set? I thought the randomness was applied at the end by sampling the output distribution.