But this is just the SFT - "distilled" model, not the one optimized with RL, right?
Oh I think it's SFT + RL as mentioned in the paper - they said combining both is actually more performant than just RL