But reward models are always curated by humans. If you generate a reward model with an LLM, it will contain hallucinations that need to be corrected by humans. But that is what a reward model is for. To correct the hallucinations of LLMs.
So yeah theoretically you could generate reward models with LLMs, but they won't be any good, unless they are curated by other reward models that are ultimately curated by humans.
> But reward models are always curated by humans.
There is no inherent reason why they need to be.
> So yeah theoretically you could generate reward models with LLMs, but they won't be any good, unless they are curated by other reward models that are ultimately curated by humans.
This reasoning is begging the question: The reasoning is true only if the conclusion is true. It's therefore a logically invalid argument.
There is no inherent reason why this needs to be the case.
Sorry but I don't follow your logic. Are you claiming that reward models that aren't curated by humans perform as well as ones that are?
Then what is a reward model's function according to you?
I'm claiming exactly what I wrote: That there is no inherent reason why a human curated one needs to be better.
In reinforcement learning and related fields, a _reward model_ is a function that assigns a scalar value (a reward) to a given state, representing how desirable it is. You're at liberty to have compound states: for an example, a trajectory (often called tau) or a state action pair (typically represented by s and a).