danielhanchen 2 days ago

Their paper https://mistral.ai/static/research/magistral.pdf is also cool! They edited GRPO via:

1. Removed KL Divergence

2. Normalize by total length (Dr. GRPO style)

3. Minibatch normalization for advantages

4. Relaxing trust region

2
gyrovagueGeist 2 days ago

Does anyone know why they added minibatch advantage normalization (or when it can be useful)?

The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?

danielhanchen 1 day ago

Tbh I'm unsure as well I took a skim of the paper so if I find anything I'll post it here!

Onavo 2 days ago

> Removed KL Divergence

Wait, how are they computing the loss?

danielhanchen 2 days ago

Oh it's the KL term sorry - beta * KL ie they set beta to 0.

The goal of it was to "force" the model not to stray to far away from the original checkpoint, but it can hinder the model from learning new things

trc001 2 days ago

It's become trendy to delete it. I say trendy because many papers delete it without offering any proof that it is meaningless

mjburgess 2 days ago

It's just a penalty term that they delete