Their benchmarks are interesting. They are comparing to DeepSeek-V3's (non-reasoning) December and DeepSeek-R1's January releases. I feel that comparing to DeepSeek-R1-0528 would be more fair.
For example, R1 scores 79.8 on AIME 2024, R1-0528 performs 91.4.
R1 scores 70 on AIME 2025, R1-0528 scores 87.5. R1-0528 does similarly better for GPQA Diamond, LiveCodeBench, and Aider (about 10-15 points higher).
I presume that "outdated upon release" benchmarks like these happen because the benchmark and the models in it were chosen first, before the model was created; and the model's development progress was measured using the benchmark. It then doesn't occur to anyone that the benchmark the engineers had been relying upon isn't also a good/useful benchmark for marketing upon release. From the inside view, it's just a benchmark, already there, already achieving impressive results, a whole-company internal target to hit for months — so why not publish it?
Would also be interesting to compare with R1-0528-Qwen3-8B (chain-of-thought distilled from Deepseek-R1-0528 and post-trained into Qwen3-8B). It scores 86 and 76 on AIME 2024 and 2025 respectively.
Currently running the 6-bit XL quant on a single old RTX 2080 Ti and I'm quite impressed TBH. Simply wild for a sub-8GB download.
I have the same card on my machine at home, what is your config to run the model?
Downloaded the gguf file by unsloth, ran llama-cli from llama.cpp with that file as an argument.
IIUC, nowadays there is a jinja templated metadata-struct inside the gguf file itself. This contains the chat template and other config.