Item 44111084

rs186 • 7 days ago

This paper is slightly outdated by LLM model standards -- GPT 4.1 or Gemini 2.5 haven't been released at that time.

Yes, I mentioned that in the comment in the linked post. I wish someone was running this methodology as an ongoing project, for new models.

Ideally, isn't this a metric that should be included on all model cards? It seems like a crucial metric.