>Not to mention most of it basically died when most models started supporting +1M context.
Do most models support that much context? I don't think anything close to "most" models support 1M+ context. I'm only aware of Gemini, but I'd love to learn about others.
GPT 4.1 / mini / nano
As the context grows, all LLMs appear to turn into idiots, even just at 32k!
> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.
This paper is slightly outdated by LLM model standards -- GPT 4.1 or Gemini 2.5 haven't been released at that time.
Yes, I mentioned that in the comment in the linked post. I wish someone was running this methodology as an ongoing project, for new models.
Ideally, isn't this a metric that should be included on all model cards? It seems like a crucial metric.