Item 44237711

cpldcpu • 2 days ago

But this is just the SFT - "distilled" model, not the one optimized with RL, right?

danielhanchen • 2 days ago

Oh I think it's SFT + RL as mentioned in the paper - they said combining both is actually more performant than just RL