Reward model
Definition: A reward model is a model trained to predict a quality score for a response, mimicking human preferences, in order to guide the training of another model.
It acts as an automatic judge at the heart of RLHF: the main model is optimized to earn high scores from it. The quality of this reward model directly drives the quality of the resulting alignment.