I don't understand why the benchmark selections are so scattered and limited. It only compares Magistral Medium with Deepseek V3, R1, and the other close weighted Mistral Medium 3. Why did they leave off Magistral Small entirely, alongside comparisons with Alibaba Qwen or the mini versions of o3 and o4?
When they include comparisons, it is always a deliberate decision what to show and, more importantly, what not to show. If they had data that would show better performance compared to those models, there is no reason for them to not emphasize that.