Alright, let's put the pieces together. Having explored the individual stages of Supervised Fine-Tuning (SFT), Reward Modeling (RM), and RL Fine-Tuning with PPO, this practical exercise focuses on integrating them into a functional, albeit simplified, end-to-end loop. The goal here isn't to achieve state-of-the-art alignment but to demonstrate the mechanics of how these components interact within a single training process, using tools like the Hugging Face TRL library.
We assume you have access to:
gpt2
, distilbert-base-uncased
). For simplicity, we might even use the base model directly as our starting "SFT" policy in this minimal example.(query, response)
pair (or just the response, depending on its training) and output a scalar score.transformers
, torch
(or tensorflow
), and trl
installed.First, we need to load our models and configure the PPO trainer. We'll use placeholder names; replace them with your actual model paths or Hugging Face identifiers.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLMWithValueHead, pipeline
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
# 1. Configuration - Minimal PPO settings for demonstration
ppo_config = PPOConfig(
model_name="gpt2", # Or your specific SFT model path
learning_rate=1.41e-5,
batch_size=4, # Small batch size for illustration
mini_batch_size=2,
gradient_accumulation_steps=1,
log_with="tensorboard", # Optional: for logging
kl_penalty="kl", # Use KL penalty
target_kl=0.1, # Target KL divergence
init_kl_coef=0.2, # Initial KL coefficient
adap_kl_ctrl=True, # Use adaptive KL control
ppo_epochs=4, # Number of optimization epochs per batch
seed=0,
)
# 2. Load Models and Tokenizer
# Policy Model (Actor/Critic): Initialize from SFT/base model
# AutoModelForCausalLMWithValueHead combines LM head and value head
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(ppo_config.model_name)
# Reference Model (for KL divergence): Keep a copy of the initial policy
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(ppo_config.model_name)
tokenizer = AutoTokenizer.from_pretrained(ppo_config.model_name)
# Ensure pad token is set for tokenizer
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Reward Model (RM): Load separately, assuming it's a text-classification style pipeline
# Replace with your actual RM loading mechanism
# Example: Using a pipeline for simplicity
reward_model_name = "path/to/your/reward/model" # Replace!
# Note: This might require a custom pipeline or direct model loading depending on your RM
try:
# Simplistic example assuming a compatible sentiment/reward pipeline
reward_pipe = pipeline("text-classification", model=reward_model_name, device=policy_model.device)
print("Reward model loaded via pipeline.")
# Define a function to get the scalar score
def get_reward_score(texts):
# Process texts, potentially format them as (query, response) if needed
# This depends highly on your reward model's input format
# Assuming RM outputs a list of dicts like [{'label': 'POSITIVE', 'score': 0.9}]
results = reward_pipe(texts, return_all_scores=True) # Adapt based on your pipeline
# Extract the desired score (e.g., score for "POSITIVE" or a specific score index)
# This extraction logic is highly dependent on your RM's output
scores = []
for result in results:
# Example: find score for a specific label, or assume first score is reward
# Adjust this logic based on your reward model structure
score = 0.0 # Default score
if isinstance(result, list): # Handle varying pipeline outputs
for label_score in result:
if label_score['label'] == 'POSITIVE': # Example label
score = label_score['score']
break
elif isinstance(result, dict):
score = result.get('score', 0.0) # Simplistic fallback
scores.append(torch.tensor(score, device=policy_model.device))
return scores
except Exception as e:
print(f"Warning: Could not load reward model pipeline '{reward_model_name}'. Using dummy rewards. Error: {e}")
# Fallback to a dummy reward function if RM loading fails
def get_reward_score(texts):
# Dummy reward: score based on length (just for demonstration)
return [torch.tensor(len(text) / 100.0, device=policy_model.device) for text in texts]
# 3. Initialize PPOTrainer
ppo_trainer = PPOTrainer(
config=ppo_config,
model=policy_model,
ref_model=ref_model,
tokenizer=tokenizer,
# Dataset can be omitted for this manual loop example
# value_model requires separate setup if not using AutoModelForCausalLMWithValueHead
)
print("Setup complete. Ready for the simplified RLHF loop.")
Now, let's execute a few steps of the RLHF loop. We'll manually provide queries, generate responses, get rewards, and perform the PPO update.
# Define some example queries
queries = [
"Explain the concept of KL divergence in simple terms:",
"Write a short poem about a robot learning:",
"What are the main stages of RLHF?",
"Suggest a name for a friendly AI assistant:",
]
# Tokenize queries
query_tensors = [tokenizer.encode(q, return_tensors="pt").to(policy_model.device) for q in queries]
# Generation settings for the policy model
generation_kwargs = {
"min_length": -1, # Allow stopping early
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.pad_token_id,
"max_new_tokens": 64, # Limit response length for demonstration
}
# Run a few PPO steps (e.g., 2 steps)
num_steps = 2
for step in range(num_steps):
print(f"\n--- PPO Step {step + 1} ---")
# 1. Rollout: Generate responses from the policy model
response_tensors = []
for query_tensor in query_tensors:
# Generate response; response includes query and generated part
response = ppo_trainer.generate(query_tensor.squeeze(0), **generation_kwargs)
response_tensors.append(response.squeeze())
# Decode responses for reward calculation and logging
decoded_responses = [tokenizer.decode(r.squeeze(), skip_special_tokens=True) for r in response_tensors]
# 2. Reward Calculation: Score the generated responses using the RM
# Format texts for the reward model if necessary (e.g., combining query + response)
# This example assumes RM scores the full generated text including the prompt
reward_texts = decoded_responses
rewards = get_reward_score(reward_texts) # List of tensor scalars
# 3. PPO Optimization Step
# Prepare inputs for ppo_trainer.step
# query_tensors need to be List[torch.Tensor]
# response_tensors need to be List[torch.Tensor]
# rewards need to be List[torch.Tensor] (scalar reward per sample)
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
# 4. Logging
print(f"Query examples: {[q[:50] + '...' for q in queries]}")
print(f"Response examples: {[r[len(q):][:80] + '...' for q, r in zip(queries, decoded_responses)]}")
print(f"Mean reward: {torch.mean(torch.stack(rewards)).item():.4f}")
if 'ppo/kl' in stats:
print(f"KL Divergence: {stats['ppo/kl']:.4f}")
if 'ppo/loss/policy' in stats:
print(f"Policy Loss: {stats['ppo/loss/policy']:.4f}")
if 'ppo/loss/value' in stats:
print(f"Value Loss: {stats['ppo/loss/value']:.4f}")
# Optional: Log detailed stats if using a logger like TensorBoard
# ppo_trainer.log_stats(stats, queries, response_tensors, rewards)
print("\nSimplified RLHF loop finished.")
The following diagram illustrates the flow of data within one iteration of this simplified loop:
Data flow in a single PPO step for RLHF. Queries prompt the policy model to generate responses, which are then scored by the reward model. The PPO trainer uses queries, responses, and reward scores to compute losses and update the policy model's parameters.
In this simplified execution, you should observe the following:
ppo_trainer.step
function returns statistics. Monitor the mean_reward
; ideally, it should trend upwards if the policy learns to generate responses favored by the RM. Keep an eye on the KL Divergence
(ppo/kl
) to ensure the policy doesn't deviate too drastically from the original reference model, preventing collapse. The policy and value losses (ppo/loss/policy
, ppo/loss/value
) indicate the optimization progress.This practical exercise strips down the RLHF process to its core loop: generate, score, update. It highlights the interaction points between the policy model, reward model, and the PPO algorithm managed by the trainer. While real-world RLHF involves much larger scales, sophisticated data handling, careful hyperparameter tuning, and distributed training, this hands-on example provides a tangible feel for the underlying mechanism connecting the components discussed throughout this chapter. Building upon this foundation, you can scale up the implementation, integrate proper dataset handling, and refine the configuration for more substantial alignment tasks.
© 2025 ApX Machine Learning