Masterclass
Once human preference data has been collected, typically in the form of pairwise comparisons (prompt, chosen response, rejected response), the next step in the Reinforcement Learning from Human Feedback (RLHF) pipeline is to train a Reward Model (RM). The purpose of the RM is to learn a function that maps a prompt and a potential response to a scalar value, representing the degree to which that response aligns with human preferences for the given prompt. This learned reward function will later serve as the objective signal for fine-tuning the language model using reinforcement learning.
The architecture for the reward model is often closely related to the base language model being aligned. A common practice is to start with the Supervised Fine-Tuned (SFT) model itself and modify its final layer. Instead of predicting the next token distribution, the RM's head is adapted to output a single scalar value.
Specifically, the input to the RM is the concatenation of the prompt (x) and a candidate response (y). This combined sequence is processed by the transformer architecture. The hidden state corresponding to the final token of the sequence (often the end-of-sequence token) is then passed through a linear layer (the reward head) to produce the scalar reward score.
Using the SFT model as the base for the RM offers significant advantages:
Let rθ(x,y) denote the scalar reward output by the RM with parameters θ for prompt x and response y.
The RM is trained on the collected preference dataset D={(x(i),yc(i),yr(i))}i=1N, where yc is the response preferred by humans (chosen) and yr is the response deemed less preferable (rejected) for prompt x. The objective is to train the RM such that it assigns a higher score to the chosen response compared to the rejected one for the same prompt:
rθ(x,yc)>rθ(x,yr)This is typically framed as a binary classification problem on pairs of responses. A common approach adapts the Bradley-Terry model, which models the probability that yc is preferred over yr. This probability can be modeled using the difference in their reward scores passed through a logistic sigmoid function σ(z)=1/(1+e−z):
P(yc≻yr∣x)=σ(rθ(x,yc)−rθ(x,yr))The RM is trained by minimizing the negative log-likelihood of the human preferences in the dataset D. The loss function becomes:
L(θ)=−E(x,yc,yr)∼D[log(σ(rθ(x,yc)−rθ(x,yr)))]This loss encourages the difference rθ(x,yc)−rθ(x,yr) to be large and positive, effectively maximizing the probability of correctly classifying the preferred response according to the human labels. Sometimes, a margin term can be added, but this basic form is widely used.
During training, each data point (x,yc,yr) requires two forward passes through the RM: one for the prompt concatenated with the chosen response (x⊕yc) and one for the prompt concatenated with the rejected response (x⊕yr).
Here's a simplified PyTorch snippet illustrating the loss calculation within a training step:
import torch
import torch.nn.functional as F
# Assume 'reward_model' is your RM instance (e.g., a Transformer with a
# scalar head)
# Assume 'tokenizer' is your tokenizer instance
# Assume 'batch' contains tuples of (prompt, chosen_response,
# rejected_response) strings
def compute_rm_loss(reward_model, tokenizer, batch):
"""Computes the pairwise ranking loss for a batch of preference data."""
prompts, chosen_responses, rejected_responses = batch
# Tokenize and prepare inputs for chosen responses
chosen_inputs = tokenizer(
[p + c for p, c in zip(prompts, chosen_responses)],
return_tensors="pt",
padding=True,
truncation=True,
max_length=1024 # Example max length
)
# Move tensors to the correct device
chosen_inputs = {
k: v.to(reward_model.device) for k, v in chosen_inputs.items()
}
# Tokenize and prepare inputs for rejected responses
rejected_inputs = tokenizer(
[p + r for p, r in zip(prompts, rejected_responses)],
return_tensors="pt",
padding=True,
truncation=True,
max_length=1024 # Example max length
)
rejected_inputs = {
k: v.to(reward_model.device) for k, v in rejected_inputs.items()
}
# Get reward scores from the model
# The reward_model forward pass should return a scalar score
# per sequence
chosen_rewards = reward_model(**chosen_inputs).rewards
# Assuming model output has .rewards attribute
rejected_rewards = reward_model(**rejected_inputs).rewards
# Calculate the loss
# loss = -log(sigmoid(chosen_rewards - rejected_rewards))
loss = -F.logsigmoid(
chosen_rewards - rejected_rewards
).mean()
return loss
# --- In your training loop ---
# optimizer.zero_grad()
# loss = compute_rm_loss(reward_model,
# tokenizer, batch_data)
# loss.backward()
# optimizer.step()
Evaluating the RM's performance is important before using it in the RL phase. The primary metric is accuracy on a held-out set of preference pairs. This measures how often the RM correctly predicts the human-preferred response (rc>rr). Accuracies typically range from 65% to 80%, depending on the task difficulty, data quality, and model capacity.
Basic workflow for training the Reward Model using preference data and a pairwise ranking loss.
Beyond accuracy, qualitative analysis is useful. Examining cases where the RM strongly agrees or disagrees with human judgments can reveal biases or weaknesses in the model. It's also helpful to check if the reward scores correlate with other intuitive metrics of quality, like response length, coherence, or helpfulness, although these correlations might be weak.
Successfully training a reward model is a significant step in the RLHF process. A well-trained RM provides the important signal needed to guide the LLM during the subsequent reinforcement learning phase, steering it towards generating outputs that better align with desired human characteristics like helpfulness, honesty, and harmlessness.
© 2025 ApX Machine Learning