Would also be interesting to compare with R1-0528-Qwen3-8B (chain-of-thought distilled from Deepseek-R1-0528 and post-trained into Qwen3-8B). It scores 86 and 76 on AIME 2024 and 2025 respectively.
Currently running the 6-bit XL quant on a single old RTX 2080 Ti and I'm quite impressed TBH. Simply wild for a sub-8GB download.
I have the same card on my machine at home, what is your config to run the model?
Downloaded the gguf file by unsloth, ran llama-cli from llama.cpp with that file as an argument.
IIUC, nowadays there is a jinja templated metadata-struct inside the gguf file itself. This contains the chat template and other config.