As introduced, Reinforcement Learning from Human Feedback (RLHF) relies on aligning a language model's behavior with human preferences. The first major step in this process involves training a separate model, the Reward Model (RM), denoted as rϕ(x,y). This model's objective is to learn a function that takes a prompt x and a generated response y as input and outputs a scalar value representing how much a human would likely prefer that response. Essentially, the reward model acts as a learned proxy for human judgment.
Instead of directly using human feedback during the computationally intensive LLM fine-tuning phase (which would be slow and impractical), we first distill human preferences into the reward model. This RM can then provide dense feedback signals during the subsequent policy optimization stage (using algorithms like PPO), guiding the LLM πθ(y∣x) towards generating outputs that score highly according to the learned preference function.
Training the reward model requires a specialized dataset consisting of human preferences. While asking humans for absolute quality scores (e.g., rating a response from 1 to 10) is possible, it often suffers from inconsistency and poor calibration across different annotators and prompts.
A more common and often more reliable approach is to collect comparison data. In this setup, for a given prompt x, multiple responses (y1,y2,...,yk) are generated by one or more versions of the language model. Human annotators are then asked to rank these responses from best to worst, or more simply, to choose the single best response among a pair.
This comparison process yields data points typically structured as tuples: (x,yw,yl), where yw is the preferred ("winning") response and yl is the less preferred ("losing") response for the prompt x. Compiling a large dataset of these comparisons (D={(x(i),yw(i),yl(i))}) forms the foundation for training the reward model.
Diagram illustrating the typical workflow for generating human preference data used in reward model training.
The architecture for the reward model often mirrors the base language model being fine-tuned. A common practice is to start with the pre-trained weights of the LLM (or a smaller version for efficiency) and replace or append a final linear layer. This new layer is trained to output a single scalar value (the reward score) instead of predicting the next token probabilities.
Initializing the RM from the pre-trained LLM is advantageous because the model already possesses a strong understanding of language structure, semantics, and context captured in the prompt x and response y. The training process then focuses on adapting this understanding to predict the specific human preference signal represented in the comparison data.
The core idea is to train the RM parameters ϕ such that the preferred response yw consistently receives a higher score than the rejected response yl for the same prompt x. This is typically framed as a classification or ranking problem.
A widely used objective function is based on the Bradley-Terry model, which models the probability that yw is preferred over yl:
P(yw≻yl∣x)=σ(rϕ(x,yw)−rϕ(x,yl))Here, σ is the sigmoid function. The training objective is to maximize the likelihood of the observed human preferences in the dataset D. This translates to minimizing the negative log-likelihood loss:
L(ϕ)=−E(x,yw,yl)∼D[logσ(rϕ(x,yw)−rϕ(x,yl))]This loss function encourages the reward model rϕ to output a larger difference between the scores of the winning and losing responses. The training proceeds using standard gradient-based optimization methods like Adam.
Once a sufficiently accurate reward model is trained, it serves as the objective function for the next stage: fine-tuning the language model's policy using reinforcement learning.
© 2025 ApX Machine Learning