Item 44239614

visiondude • 3 days ago

always seemed to me that efficient caching strategies could greatly reduce costs… wonder if they cooked up something new

xmprt • 2 days ago

How are LLMs cached? Every prompt would be different so it's not clear how that would work. Unless you're talking about caching the model weights...

5 replies

hadlock • 2 days ago

I've asked it a question not in it's dataset three different ways and I see the same three sentences in the response, word for word, which could imply it's caching the core answer. I hadn't previously seen this behavior before this last week.

1 reply

beering • 1 day ago

Isn’t the simpler explanation that if you ask the same question, there’s a chance you would get the same answer?

In this case you didn’t even get the same answer, you only happened to have one sentence in the answer match.

HugoDias • 2 days ago

This document explains the process very well. It’s a good read: https://platform.openai.com/docs/guides/prompt-caching

2 replies

xmprt • 2 days ago

That link explains how OpenAI uses it, but doesn't really walk through how it's any faster. I thought the whole point of transformers was that inference speed no longer depended on prompt length. So how does caching the prompt help reduce latency if the outputs aren't being cached.

> Regardless of whether caching is used, the output generated will be identical. This is because only the prompt itself is cached, while the actual response is computed anew each time based on the cached prompt

1 reply

singron • 2 days ago

> I thought the whole point of transformers was that inference speed no longer depended on prompt length

That's not true at all and is exactly what prompt caching is for. For one, you can at least populate the attention KV Cache, which will scale with the prompt size. It's true that if your prompt is larger than the context size, then the prompt size no longer affects inference speed since it essentially discards the excess.

catlifeonmars • 2 days ago

> OpenAI routes API requests to servers that recently processed the same prompt,

My mind immediately goes to rowhammer for some reason.

At the very least this opens up the possibility of some targeted denial of service

1 reply

xmprt • 2 days ago

Later they mention that they have some kind of rate limiting because if over ~15 requests are being processed per minute, the request will be sent to a different server. I guess you could deny cache usage but I'm not sure what isolation they have between different callers so maybe even that won't work.

2 replies

catlifeonmars • 2 days ago

So the doc mentions you can influence the cache key by passing an optional user parameter. It’s unclear from the doc whether the user parameter is validated or if you can just provide an arbitrary string.

catlifeonmars • 2 days ago

15 requests/min is pretty low. Depending on how large the fleet is you might end up getting load balanced to the same one and if it’s round robin then it would be deterministic

amanda99 • 2 days ago

You would use a KV cache to cache a significant chunk of the inference work.

2 replies

xmprt • 2 days ago

Using KV in the caching context is a bit confusing because it usually means key-value in the storage sense of the word (like Redis), but for LLMs, it means the key and value tensors. So IIUC, the cache will store the results of the K and V matrix multiplications for a given prompt and the only computation that needs to be done is the Q and attention calculations.

biophysboy • 2 days ago

Do you mean that they provide the same answer to verbatim-equivalent questions, and pull the answer out of storage instead of recalculating each time? I've always wondered if they did this.

2 replies

Traubenfuchs • 2 days ago

I bet there is a set of repetitive single, or two, question user requests that makes out a sizeable amount of all requests. The models are so expensive to run, 1% would be enough. Much less than 1%. To make it less obvious they probably have a big set of response variants. I don't see how they would not do this.

They probably also have cheap code or cheap models that normalize requests to increase cache hit rate.

koakuma-chan • 2 days ago

The prompt may be the same but the seed is different every time.

1 reply

biophysboy • 2 days ago

Could you not cache the top k outputs given a provided input token set? I thought the randomness was applied at the end by sampling the output distribution.

koakuma-chan • 2 days ago

A lot of the prompt is always the same: the instructions, the context, the codebase (if you are coding), etc.

tasuki • 2 days ago

> Every prompt would be different

No? Eg "how to cook pasta" is probably asked a lot.