Supervised Fine-Tuning aligns a model to mimic demonstrations, but it doesn't explicitly teach the model what makes one response better than another according to human values. To instill this sense of quality, we need a different approach. Instead of asking humans for absolute scores for generated text, which can be difficult, inconsistent, and subjective, RLHF relies on a more intuitive form of feedback: pairwise comparisons.
Humans are generally better at stating which of two options they prefer rather than assigning a precise numerical score to each. Think about judging essays, art, or even simple choices. It's often easier to say "A is better than B" than it is to assign A a score of 8.5 and B a score of 6.2 with confidence and consistency. Learning from preferences leverages this human capability.
The core idea is that human preferences between pairs of responses, given the same prompt, implicitly reveal an underlying reward function. If a human consistently prefers response y1 over response y2 for a given prompt x, it suggests that y1 possesses more of the desired qualities (helpfulness, harmlessness, accuracy, etc.) than y2. We assume there's a latent scalar function, the Reward Model (RM), denoted RM(x,y), that assigns a score reflecting these desired qualities. The preference y1≻y2 (read as "y1 is preferred over y2") suggests that RM(x,y1)>RM(x,y2).
Our goal is to train a model, typically a neural network based on the same architecture as the language model being tuned, to approximate this latent reward function. This RM takes the prompt and a response as input and outputs a scalar value representing the predicted human preference score.
How do we train such a model using only comparison data? We frame the learning problem probabilistically. Models like the Bradley-Terry model (or variations) provide a mathematical link between pairwise comparisons and the underlying scores. As introduced in the chapter overview, we model the probability that a human prefers response y1 over y2 given prompt x as a function of the difference between their respective reward model scores:
P(y1≻y2∣x)=σ(RM(x,y1)−RM(x,y2))Here, σ is the sigmoid function, σ(z)=1/(1+e−z).
Let's unpack this formula:
This probabilistic framing allows us to use standard machine learning techniques, specifically binary cross-entropy loss, to train the RM. The training data consists of tuples (x,y1,y2), where y1 is the preferred ("winning") response and y2 is the non-preferred ("losing") response according to human labelers. The model learns to adjust its parameters such that it assigns higher scores to winning responses and lower scores to losing responses for a given prompt.
Diagram illustrating how a prompt and two responses lead to a human preference label, which serves as the target for training the Reward Model. The RM processes both responses to predict scores, their difference, and ultimately the probability of one being preferred over the other.
Learning from preferences offers several advantages over directly specifying or regressing onto absolute reward scores:
By training a model to predict these pairwise preferences, we create a reward signal that reflects complex, nuanced human judgments. This learned RM becomes the objective function that the language model will later be optimized against using reinforcement learning techniques like PPO, guiding it towards generating responses that align better with human expectations. The next sections delve into the practicalities of gathering this preference data and training the reward model itself.
© 2025 ApX Machine Learning