This paper is slightly outdated by LLM model standards -- GPT 4.1 or Gemini 2.5 haven't been released at that time.
Yes, I mentioned that in the comment in the linked post. I wish someone was running this methodology as an ongoing project, for new models.
Ideally, isn't this a metric that should be included on all model cards? It seems like a crucial metric.