cpldcpu 2 days ago

But this is just the SFT - "distilled" model, not the one optimized with RL, right?

1
danielhanchen 2 days ago

Oh I think it's SFT + RL as mentioned in the paper - they said combining both is actually more performant than just RL