After collecting human preference data, typically in the form of pairs (x,yw,yl) where yw is the preferred ("winning") response and yl is the rejected ("losing") response to a prompt x, the next step in the RLHF pipeline is to train a reward model, rθ(x,y). This model's purpose is to learn a scalar function that reflects the human preferences captured in the dataset. Essentially, we want rθ(x,yw)>rθ(x,yl) for the observed preference pairs. This learned reward function will later guide the fine-tuning of the LLM policy.
The architecture of the reward model is a significant design choice. A prevalent and effective approach is to leverage the power of pre-trained language models themselves.
LLM-Based Architecture: The most common strategy involves initializing the reward model using the weights of the same pre-trained LLM that will eventually be fine-tuned (or a model from the same family, perhaps differing in size). The core idea is to adapt this pre-trained model to predict preference scores instead of generating text.
prompt_text [SEP] response_text [EOS]
).[EOS]
token) is extracted. This vector representation, which summarizes the entire input sequence (prompt and response), is then passed through a linear layer that projects it down to a single scalar value. This scalar output represents the predicted reward rθ(x,y).A typical reward model architecture using a pre-trained LLM base, processing concatenated prompt and response, and outputting a scalar reward via a linear head attached to the final token's hidden state.
Initialization: Starting with pre-trained weights allows the reward model to benefit from the language understanding capabilities learned during the LLM's initial training. During reward model training, one might choose to fine-tune all the parameters of the base LLM or freeze the majority of the base layers and only train the final layers and the reward head. Fine-tuning the entire model can lead to better adaptation but is more computationally expensive and risks catastrophic forgetting. Freezing most layers is faster but might limit the model's ability to learn nuanced preferences.
Alternative Architectures: While using the target LLM's architecture is standard, alternatives exist. One could train a smaller, independent transformer model or even use different model types. However, these approaches might struggle to capture the same level of linguistic nuance required to accurately judge the quality of outputs generated by a large base model. Using a model from the same family generally ensures architectural compatibility and leverages relevant pre-trained knowledge.
Given the preference data (x,yw,yl), the reward model is trained to assign a higher score to yw than to yl. The standard objective function for this is a pairwise ranking loss based on the Bradley-Terry model, which is commonly used for modeling preferences between pairs of items.
The loss function aims to maximize the probability that the difference in rewards between the chosen and rejected responses aligns with the human label. It is typically formulated as the negative log-likelihood of the preferences:
L(θ)=−E(x,yw,yl)∼D[logσ(rθ(x,yw)−rθ(x,yl))]Let's break down this formula:
During training, for each triplet (x,yw,yl) in a batch, the reward model performs two forward passes: one for (x,yw) to get rθ(x,yw) and one for (x,yl) to get rθ(x,yl). The loss is computed based on the difference, and gradients are backpropagated to update the model parameters θ.
Here's a conceptual sketch of the training step using PyTorch-like pseudocode:
import torch
import torch.nn.functional as F
# Assume:
# reward_model: The model taking tokenized input_ids and attention_mask, returning a scalar.
# tokenizer: The tokenizer corresponding to the reward_model base.
# optimizer: An optimizer like AdamW.
# dataloader: Provides batches of {'prompt': [...], 'chosen': [...], 'rejected': [...]}
def train_reward_model_step(batch, reward_model, tokenizer, optimizer, device):
"""Performs a single training step for the reward model."""
prompts = batch['prompt']
chosen_responses = batch['chosen']
rejected_responses = batch['rejected']
# Prepare inputs for chosen and rejected responses
chosen_texts = [p + tokenizer.sep_token + r + tokenizer.eos_token for p, r in zip(prompts, chosen_responses)]
rejected_texts = [p + tokenizer.sep_token + r + tokenizer.eos_token for p, r in zip(prompts, rejected_responses)]
# Tokenize (handle padding and truncation appropriately)
chosen_encodings = tokenizer(chosen_texts, padding=True, truncation=True, return_tensors="pt").to(device)
rejected_encodings = tokenizer(rejected_texts, padding=True, truncation=True, return_tensors="pt").to(device)
# Forward passes to get rewards
# Assuming reward_model outputs a dict containing 'rewards' tensor
rewards_chosen = reward_model(**chosen_encodings).rewards # Shape: (batch_size, 1)
rewards_rejected = reward_model(**rejected_encodings).rewards # Shape: (batch_size, 1)
# Ensure shapes are compatible if needed, e.g., squeeze last dimension
rewards_chosen = rewards_chosen.squeeze(-1) # Shape: (batch_size,)
rewards_rejected = rewards_rejected.squeeze(-1) # Shape: (batch_size,)
# Calculate pairwise loss
# loss = -log(sigmoid(chosen_reward - rejected_reward))
loss = -F.logsigmoid(rewards_chosen - rewards_rejected).mean()
# Optimization step
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
# --- Training Loop ---
# for epoch in range(num_epochs):
# for batch in dataloader:
# loss_val = train_reward_model_step(batch, reward_model, tokenizer, optimizer, device)
# # Log loss_val, handle checkpoints, etc.
This training process yields a reward model rθ(x,y) that ideally captures the nuances of human preference regarding the helpfulness, honesty, and harmlessness of LLM responses. Once sufficiently trained and evaluated, this model becomes a crucial component in the next stage: fine-tuning the LLM policy using reinforcement learning algorithms like PPO, where rθ(x,y) serves as the reward signal.
© 2025 ApX Machine Learning