Alright, let's translate the theory of reward modeling into practice. This section provides a hands-on guide to training your own Reward Model (RM) using human preference data. We'll leverage common libraries like Hugging Face's Transformers
and TRL
(Transformer Reinforcement Learning) to streamline the process. The goal is to build a model that takes a prompt and a response, outputting a scalar score representing how likely a human would prefer that response.
Before writing code, ensure your environment is set up with PyTorch or TensorFlow, the Transformers
library, Datasets
, and TRL
.
pip install torch transformers datasets trl accelerate bitsandbytes
(Replace torch
with tensorflow
if you are using TensorFlow, although TRL
currently has better support for PyTorch).
We'll work with a dataset structured around pairwise preferences. A typical entry contains:
prompt
.chosen
response (the one preferred by humans).rejected
response (the one dispreferred).Datasets like Anthropic's HH-RLHF or subsets available on the Hugging Face Hub (e.g., trl-internal-testing/hh-rlhf-trl-style
) follow this structure. For this example, let's assume we have loaded such a dataset into a Hugging Face Dataset
object.
from datasets import load_dataset
# Load a sample dataset (replace with your actual dataset)
# This example uses a small subset for demonstration
dataset = load_dataset("trl-internal-testing/hh-rlhf-trl-style", split="train[:1%]")
# Explore the structure
print(dataset[0])
# Expected output structure: {'prompt': '...', 'chosen': '...', 'rejected': '...'}
The RM typically uses a pre-trained language model backbone (e.g., distilbert-base-uncased
, roberta-base
, or even larger models depending on your needs and resources) with a regression head. This head is usually a single linear layer added on top of the base model's output. It maps the final hidden state representation of the input sequence (prompt + response) to a single scalar value – the reward score.
Diagram illustrating the Reward Model architecture. Input prompt and response are processed by a pre-trained language model backbone, and a linear head outputs a single scalar reward score.
The RM needs to process both the chosen
and rejected
responses in the context of the prompt. We format the input as prompt + response
and tokenize it. Since the model processes one pair (chosen
, rejected
) at a time to compute the loss, we need to tokenize both variations for each example.
The TRL
library provides utilities that simplify this, but let's understand the core idea. We need a function that takes a data entry and returns tokenized versions for both the chosen and rejected paths.
from transformers import AutoTokenizer
import torch
# Choose a base model for your RM
model_name = "distilbert-base-uncased" # Use a small model for demonstration
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Ensure padding token is set if not already present
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def preprocess_function(examples):
# Tokenize pairs of (prompt + chosen_response) and (prompt + rejected_response)
tokenized_chosen = tokenizer(
examples['prompt'] + examples['chosen'],
truncation=True,
padding="max_length", # Or 'longest' if using dynamic padding
max_length=512 # Adjust max_length as needed
)
tokenized_rejected = tokenizer(
examples['prompt'] + examples['rejected'],
truncation=True,
padding="max_length", # Or 'longest'
max_length=512
)
# The RewardTrainer expects columns named 'input_ids_chosen', 'attention_mask_chosen', etc.
features = {}
features['input_ids_chosen'] = tokenized_chosen['input_ids']
features['attention_mask_chosen'] = tokenized_chosen['attention_mask']
features['input_ids_rejected'] = tokenized_rejected['input_ids']
features['attention_mask_rejected'] = tokenized_rejected['attention_mask']
return features
# Apply preprocessing
# Use remove_columns to keep only what the RewardTrainer needs
tokenized_dataset = dataset.map(
preprocess_function,
batched=True,
remove_columns=dataset.column_names
)
print("Sample tokenized features:", tokenized_dataset[0].keys())
# Expected: dict_keys(['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected'])
The TRL
library offers a convenient RewardTrainer
class, analogous to the standard Transformers
Trainer
, but specifically designed for RM training using the pairwise preference loss.
num_labels=1
for the scalar reward output.TrainingArguments
.train()
method.from transformers import AutoModelForSequenceClassification, TrainingArguments
from trl import RewardTrainer, RewardConfig
# 1. Load the model
# Use AutoModelForSequenceClassification, as the RM head is similar to a classification head
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
# Optionally move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# 2. Configure Training Arguments (RewardConfig inherits from TrainingArguments)
# Using RewardConfig for specific RM settings, though TrainingArguments works too.
training_args = RewardConfig(
output_dir="./reward_model_output",
num_train_epochs=1, # Adjust epochs based on dataset size and convergence
per_device_train_batch_size=4, # Adjust based on GPU memory
gradient_accumulation_steps=1,
learning_rate=2e-5,
report_to="none", # Disable wandb/tensorboard reporting for simplicity
remove_unused_columns=False, # Already handled in preprocessing
evaluation_strategy="no", # Add evaluation dataset and strategy if needed
save_strategy="epoch",
logging_steps=10, # Log training loss every 10 steps
max_length=512, # Important: Must match preprocessing max_length
)
# 3. Instantiate RewardTrainer
trainer = RewardTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=tokenized_dataset,
# Pass eval_dataset here if you have one
# peft_config=None, # Optional: configure PEFT like LoRA here
)
# 4. Train the model
print("Starting reward model training...")
train_results = trainer.train()
print("Training finished.")
# Save the final model
trainer.save_model("./reward_model_final")
tokenizer.save_pretrained("./reward_model_final")
print("Model and tokenizer saved to ./reward_model_final")
Behind the scenes, the RewardTrainer
implements the loss function discussed earlier. For each pair in the batch:
chosen
response than the rejected
one.While we skipped evaluation in the example for brevity, it's significant. A common metric is accuracy: given a held-out set of preference pairs (prompt, chosen, rejected)
, how often does the trained RM correctly assign a higher score to the chosen
response?
Accuracy=∣EvalSet∣1∑(prompt,c,r)∈EvalSetI[RM(prompt,c)>RM(prompt,r)]
Where I[⋅] is the indicator function (1 if true, 0 if false).
You can implement this by creating an eval_dataset
using the same preprocess_function
, passing it to the RewardTrainer
, and setting evaluation_strategy
in TrainingArguments
. The trainer will then report the accuracy during training.
Monitor the training loss and evaluation accuracy. Loss should decrease, and accuracy should increase.
Illustrative plot showing decreasing training loss and increasing evaluation accuracy during reward model training.
You now have a trained Reward Model saved to disk. This model encapsulates the learned human preferences from your dataset. It's ready to serve as the objective function in the next stage of the RLHF pipeline: fine-tuning the language model policy using Reinforcement Learning (specifically, PPO in our case). The scores generated by this RM will guide the policy model towards generating responses that align better with human expectations.
© 2025 ApX Machine Learning