To train a reward model (RM) effectively, we need a learning objective that aligns the model's outputs with the collected human preference data. Since our data consists of pairwise comparisons for a given prompt x, indicating a preference for a "winning" response yw over a "losing" response yl, we aim to train the RM to assign a higher scalar score to yw than to yl.
The standard approach draws inspiration from probabilistic choice models like the Bradley-Terry model. We model the probability that humans prefer yw over yl given the prompt x as a function of the difference between the reward model's scores for each response. Specifically, we use the logistic function (sigmoid, σ) to map the score difference to a probability:
P(yw≻yl∣x)=σ(RMθ(x,yw)−RMθ(x,yl))Here:
The goal of training is to find the parameters θ that maximize the likelihood of observing the preferences expressed in our dataset D={(x(i),yw(i),yl(i))}. Maximizing the likelihood is equivalent to minimizing the negative log-likelihood of the preferences. For a single preference pair (x,yw,yl), the negative log-likelihood loss is:
L(θ;x,yw,yl)=−logP(yw≻yl∣x)Substituting the probability expression, we get:
L(θ;x,yw,yl)=−logσ(RMθ(x,yw)−RMθ(x,yl))This is often referred to as the pairwise logistic loss. To train the model, we minimize the average loss over the entire preference dataset D:
Ltotal(θ)=−∣D∣1(x,yw,yl)∈D∑logσ(RMθ(x,yw)−RMθ(x,yl))Minimizing this loss function using gradient-based optimization methods (like Adam) encourages the reward model to assign a higher score to the preferred response yw compared to the less preferred response yl. The larger the difference RMθ(x,yw)−RMθ(x,yl), the lower the loss for that specific pair, pushing the model towards correctly ranking the responses according to human judgments.
The following diagram illustrates the computation flow for a single preference pair during training:
Computation flow for calculating the pairwise logistic loss for a single preference sample (x,yw,yl). The reward model RMθ computes scalar scores for both responses given the prompt. The difference in scores is passed through a sigmoid function, and the negative logarithm of this probability forms the loss contribution for this sample. This loss is then used to update the model parameters θ via backpropagation.
This training objective directly translates the pairwise human preferences into a gradient signal that shapes the reward model. A well-trained RM, optimized using this objective, provides the crucial reward signal needed for the subsequent reinforcement learning phase, guiding the language model towards generating responses that align better with human preferences.
© 2025 ApX Machine Learning