Item 44237148 - HN

danielhanchen • 2 days ago

I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUF

ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL

or

./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99

Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!

Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral

7

ozgune • 2 days ago

Their benchmarks are interesting. They are comparing to DeepSeek-V3's (non-reasoning) December and DeepSeek-R1's January releases. I feel that comparing to DeepSeek-R1-0528 would be more fair.

For example, R1 scores 79.8 on AIME 2024, R1-0528 performs 91.4.

R1 scores 70 on AIME 2025, R1-0528 scores 87.5. R1-0528 does similarly better for GPQA Diamond, LiveCodeBench, and Aider (about 10-15 points higher).

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

derefr • 1 day ago

I presume that "outdated upon release" benchmarks like these happen because the benchmark and the models in it were chosen first, before the model was created; and the model's development progress was measured using the benchmark. It then doesn't occur to anyone that the benchmark the engineers had been relying upon isn't also a good/useful benchmark for marketing upon release. From the inside view, it's just a benchmark, already there, already achieving impressive results, a whole-company internal target to hit for months — so why not publish it?

semi-extrinsic • 1 day ago

Would also be interesting to compare with R1-0528-Qwen3-8B (chain-of-thought distilled from Deepseek-R1-0528 and post-trained into Qwen3-8B). It scores 86 and 76 on AIME 2024 and 2025 respectively.

Currently running the 6-bit XL quant on a single old RTX 2080 Ti and I'm quite impressed TBH. Simply wild for a sub-8GB download.

saratogacx • 1 day ago

I have the same card on my machine at home, what is your config to run the model?

semi-extrinsic • 1 day ago

Downloaded the gguf file by unsloth, ran llama-cli from llama.cpp with that file as an argument.

IIUC, nowadays there is a jinja templated metadata-struct inside the gguf file itself. This contains the chat template and other config.

danielhanchen • 1 day ago

I'm surprised it does very well as well - that's pretty cool to see!

danielhanchen • 2 days ago

Their paper https://mistral.ai/static/research/magistral.pdf is also cool! They edited GRPO via:

1. Removed KL Divergence

2. Normalize by total length (Dr. GRPO style)

3. Minibatch normalization for advantages

4. Relaxing trust region

gyrovagueGeist • 2 days ago

Does anyone know why they added minibatch advantage normalization (or when it can be useful)?

The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?

danielhanchen • 1 day ago

Tbh I'm unsure as well I took a skim of the paper so if I find anything I'll post it here!

Onavo • 2 days ago

> Removed KL Divergence

Wait, how are they computing the loss?

danielhanchen • 2 days ago

Oh it's the KL term sorry - beta * KL ie they set beta to 0.

The goal of it was to "force" the model not to stray to far away from the original checkpoint, but it can hinder the model from learning new things

trc001 • 1 day ago

It's become trendy to delete it. I say trendy because many papers delete it without offering any proof that it is meaningless

mjburgess • 2 days ago

It's just a penalty term that they delete

monkmartinez • 2 days ago

At the risk of dating myself; Unsloth is the Bomb-dot-com!!! I use your models all the time and they just work. Thank you!!! What does llama.cpp normally use if not "jinja" for their templates?

danielhanchen • 1 day ago

Oh thanks! Yes I was gonna bring it up to them! Imo if there is a chat template, by default it should be --jinja

gavi • 2 days ago

too much thinking

https://gist.github.com/gavi/b9985f730f5deefe49b6a28e5569d46...

fzzzy • 2 days ago

My impression from running the first R1 release locally was that it also does too much thinking.

reissbaker • 1 day ago

Magistral Small seems wayyy too heavy-handed with its RL to me:

\boxed{Hey! How can I help you today?}

They clearly rewarded the \boxed{...} formatting during their RL training, since it makes it easier to naively extract answers to math problems and thus verify them. But Magistral uses it for pretty much everything, even when it's inappropriate (in my own testing as well).

It also forgets to <think> unless you use their special system prompt reminding it to.

Honestly a little disappointing. It obviously benchmarks well, but it seems a little overcooked on non-benchmark usage.

cluckindan • 2 days ago

It does not do any thinking. It is a statistical model, just like the rest of them.

boredhedgehog • 1 day ago

These kind of comments are the equivalent of going to dog owners' forums, analyzing word choices in every post and warning the dog owners about the dangers of anthropomorphizing their pets, an effort as accurate as it is boorish and ineffectual.

cluckindan • 1 day ago

Dogs will not be quite as widely influencing decisions concerning other people.

robmccoll • 2 days ago

What are we doing when we think?

cluckindan • 1 day ago

Human neurons are not reducible to arithmetic artificial neurons in a statistical model. Do not conflate them.

jeffhuys • 1 day ago

Why not, actually?

cluckindan • 1 day ago

Because we do not have a complete understanding of human neurons. How are we supposed to accurately model something we cannot directly observe?

TheDong • 1 day ago

Do you also complain when someone says "Half-life 2 has great water-physics" with "Don't call it physics, we still don't understand all the physical laws of the universe, and also they use limited-precision floating-point, so it's not water-physics, it's just a bunch of math"?

Like, we've agreed that "water-physics" and "cloth physics" in 3d graphics refers to a mathematical approximation of something we don't actually understand at the subatomic level (are there strings down there? Who knows).

Can "thinking" in AI not refer to this intentionally false imitation that has a similar observable outward effect?

Like, we're okay saying minecraft's water has "water physics", why are we not okay saying "in the AI context, thinking is a term that externally looks a bit like a human thinking, even though at a deeper layer it's unrelated"?

Or is thinking special, is it like "soul" and we must defend the word with our life else we lose our humanity? If I say "that building's been thinking about falling over for 50 years", did I commit a huge faux pas against my humanity?

autoexec • 1 day ago

> Do you also complain when someone says "Half-life 2 has great water-physics"

I would if they said the water in Half-life 2 was great for quenching your thirst or that in the near future everyone will only drink water from Half-life 2 and it will flow from our kitchen taps when it's clear that however good Half-life 2 is at approximating what water looks and acts like it isn't capable of being a beverage and isn't likely to ever become one. Right now there are a lot of people going around saying that what passes for AI these days has the ability to reason and that AGI is right around the corner but that's just as obvious a lie and every bit as unlikely, but the more it gets repeated the more people end up falling for it.

It's frustrating because at some point (if it hasn't happened already) you're going to find yourself feeling very thirsty and be shocked to discover that the only thing you have access to is Half-life 2 water, even though it does nothing for you except make you even more thirsty since it looks close enough to remind you of the real thing. All because some idiot either fell for the hype or saved enough money by not supplying you with real water that they don't care how thirsty that leaves you.

The more companies force the use of flawed and unreasoning AI to do things that require actual reasoning the worse your life is going to get. The constant misrepresentation of AI and what it's capable of is accelerating that outcome.

cluckindan • 1 day ago

That’s comparing apples to oranges. Nobody is going to be making a real cruise ship based on game water physics simulations.

In such a task, better water simulations are used. We have those, because we can directly observe the behavior of water under different conditions. It’s okay because the people doing it are explicitly aware that they are using simulation.

AI will get used in real decisions affecting other people, and the people doing those decisions will be influenced by the terminology we choose to use.

inimino • 1 day ago

Just because you don't know how does not mean that we can't.

cluckindan • 1 day ago

Prove it, then.

otabdeveloper4 • 1 day ago

We don't know yet. But we do know it's certainly not statistical token prediction.

(People can do statistical token prediction too, but that's called "bullshitting", not "thinking". Thinking is a much wider class of activity.)

LordDragonfang • 1 day ago

Do we know that with certainty? Do we actually?

Because my understanding is that how "thinking" works is actually still a total mystery. How is it we no for certain that the basis for the analog electric-potential-based computing done by neurons is not based on statistical prediction?

Do we have actual evidence of that, or are you just doing "statistical token prediction" yourself?

cluckindan • 15 hours ago

You’re reversing the burden of proof in a similar manner as religious people often do. Absence of evidence is not evidence of absence, and so on.

LordDragonfang • 3 hours ago

I'm not reversing it lol. You're the one making a claim, the burden of evidence is on you.

Absence of evidence is not evidence of absence, but it is still absence of evidence. Making a claim without any is more religious that not. After all, we know humans can't be descended from monkeys!

LordDragonfang • 1 day ago

"Thinking" is a term of art referring to the hidden/internal output of "reasoning" models where they output "chain of thought" before giving an answer[1]. This technique and name stem from the early observation that LLMs do better when explicitly told to "think step by step"[2]. Hope that helps clarify things for you for future constructive discussion.

[1] https://arxiv.org/html/2410.10630v1

[2] https://arxiv.org/pdf/2205.11916

bobsomers • 1 day ago

We are aware of the term of art.

The point that was trying to be made, which I agree with, is that anthropomorphizing a statistical model isn’t actually helpful. It only serves to confuse laypersons into assuming these models are capable of a lot more than they really are.

That’s perfect if you’re a salesperson trying to dump your bad AI startup onto the public with an IPO, but unhelpful for pretty much any other reason, especially true understanding of what’s going on.

LordDragonfang • 1 day ago

If that was their point, it would have been more constructive to actually make it.

To your point, it's only anthropomorphization if you make the anthrocentric assumption that "thinking" refers to something that only humans can do.[1]

And I don't think it confuses laypeople, when literally telling it to "think" achieves the very similar results as in humans - it produces output that someone provided it out-of-context would easily identify as "thinking out loud", and improves the accuracy of results like how... thinking does.

The best mental model of RLHF'd LLMs that I've seen is that they are statistical models "simulating"[1] how a human-like character would respond to a given natural-language input. To calculate the statistically "most likely" answer that an intelligent creature would give to a non-trivial question, with any sort of accuracy, you need emergent effects which look an awful like like a (low fidelity) simulation of intelligence. This includes simulating "thought". (And the distinction between "simulating thinking" and "thinking" is a distinction without a difference given enough accuracy)

I'm curious as to what "capabilities" you think the layperson is misled about, because if anything they tend to exceed layperson understanding IME. And I'm curious what mental model you have of LLMs that provides more "true understanding" of how a statistical model can generate answers that appear nowhere in its training.

[1] It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.

[2] https://www.astralcodexten.com/p/janus-simulators

zer00eyz • 1 day ago

> It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.

And yet we added a hand wavy 7th to humanize a peice of technology.

MindTheAbstract • 1 day ago

I know this is the terminology, but I'd argue that the activations are the actual thinking. It's probably too late to change that, but I wish people would refer to thinking as the work Anthropic and Deepmind are doing with their mech interp

andrepd • 1 day ago

It's a misleading "term of art" which is more accurately described as a "term of marketing". Reasoning is precisely what LLMs don't do and it's precisely why they are unsuited to many tasks they are peddled for.

LordDragonfang • 1 day ago

How are you defining "reasoning" such that you are confident that LLMs are definitely not doing it? What evidence do you have to that effect? (And are you certain that none of your reasoning applies to humans as well?)

cluckindan • 1 day ago

They don’t ”think”.

https://arxiv.org/abs/2503.09211

They don’t ”reason”.

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

They don’t even always output their internal state accurately.

https://arxiv.org/abs/2505.05410

LordDragonfang • 23 hours ago

> https://arxiv.org/abs/2503.09211

I am thoroughly unimpressed by this paper. It sets up a vague strawman definition of "thinking" that I'm not aware of anyone using (and makes no claim it applies to humans) and then knocks down the strawman.

It also leans way too heavy on determinism - For one thing, we have no way of knowing if human brains are deterministic (until we solve whether reality itself is). For another, I doubt you would suddenly reverse your position if we created a LoRa composed of atmospheric noise, so it does not support your real position.

> https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

This one is more substantial, but:

"While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. [...] Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. [...] We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles."

Starts by saying "we actually don't understand them" (meaning we don't know well enough to give a yes or no) and then proceeds to list flaws that, as I keep saying, also can be applied to most (if not all) humans' ability to reason. Human reasoning also collapses in accuracy above a certain complexities, and certainly are observed to fail to use explicit algorithms, as well as reasoning inconsistently across puzzles.

So unless your definition of anthropomorphization excludes most humans, this is far from a slam dunk.

> They don’t even always output their internal state accurately.

I have some really bad news about humans for you. I believe (Buddha et al, 500 BCE) is the foundational text on this, but there's been some more recent research (Hume, 1739), (Kierkegaard, 1849)

cluckindan • 16 hours ago

Whodathunkit, some people are so infatuated with their simulacra that they choose to go tooth and nail in defense of the simulation.

My point was congruent with the argument that LLMs are not humans or possess human-like thinking and reasoning, and you have conveniently demonstrated that.

LordDragonfang • 3 hours ago

> My point was congruent with the argument that LLMs are not humans or possess human-like thinking and reasoning, and you have conveniently demonstrated that.

I mean, they are obviously not humans, that is trivially true, yes.

I don't know what I said makes you believe I demonstrated that they do not possess human-like thinking and reasoning, though, considering I've mostly pointed out ways they seem similar to humans. Can you articulate your point there?

lxe • 2 days ago

Thanks for all you do!

danielhanchen • 2 days ago

Thanks!

trebligdivad • 1 day ago

Nice! I'm running on CPU only, so it's interesting to compare - the Magistral-Small-2506_Q8_0.gguf runs at under 2 tokens/s on my 16 core, but your UD-IQ2_XXS gets about 5.5 tokens/s which is fast enough to be useful - but it does hallucinate a bit more and loop a little; but still actually pretty good for something so small.

danielhanchen • 1 day ago

Oh nice! I normally suggest maybe Q4_K_XL to be on the safe side :)

cpldcpu • 2 days ago

But this is just the SFT - "distilled" model, not the one optimized with RL, right?

danielhanchen • 2 days ago

Oh I think it's SFT + RL as mentioned in the paper - they said combining both is actually more performant than just RL