Does anyone know why they added minibatch advantage normalization (or when it can be useful)?
The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?
Tbh I'm unsure as well I took a skim of the paper so if I find anything I'll post it here!