Reinforcement Learning from Human Feedback (RLHF) provides a structured approach to steer Large Language Models (LLMs) towards desired behaviors, such as helpfulness, honesty, and harmlessness, by incorporating human judgments directly into the training loop. While the concept is straightforward, the implementation involves a multi-stage pipeline that combines elements of supervised learning, preference modeling, and reinforcement learning. Understanding this workflow is fundamental to applying RLHF effectively.
The typical RLHF pipeline, illustrated below, consists of three primary stages, although variations exist:
High-level overview of the RLHF pipeline stages and their interactions. SFT is often used but optional. The SFT model frequently serves as the reference policy (πref) during RL fine-tuning.
Let's examine each stage in more detail.
The process typically begins with a capable pre-trained LLM. This model has learned general language patterns from vast amounts of text data but may not be specifically adapted to follow instructions or exhibit desired behaviors like helpfulness.
Often, an intermediate step called Supervised Fine-Tuning (SFT) is performed. In this stage, the pre-trained model is fine-tuned on a smaller, high-quality dataset of input prompts and desired outputs (demonstrations). This dataset is curated to exemplify the target behavior (e.g., helpful instruction following, specific conversational styles).
The SFT stage aims to adapt the model to the distribution of prompts it will likely encounter and teach it the basic format and style of desired responses. The resulting model from this stage, let's call it πSFT, often serves as the starting point for the subsequent RLHF stages. Specifically, it becomes the initial policy for RL fine-tuning and the reference model (πref) used for regularization during RL training. While beneficial, SFT is not strictly mandatory for RLHF; one could potentially start reward modeling and RL fine-tuning directly from the base pre-trained model.
The core idea of RLHF is to learn human preferences. This is achieved by training a separate model, the reward model (rθ), to predict which LLM output a human would prefer.
Data Collection:
Model Architecture: The reward model typically shares the same architecture as the LLM being trained (or uses the base pre-trained architecture). The final layer, however, is modified to output a single scalar value representing the predicted reward (preference score) instead of token probabilities.
Training Objective: The reward model rθ(x,y) is trained to assign a higher score to the preferred response yw than the less preferred response yl. A common approach uses a loss function derived from the Bradley-Terry model, which models the probability that yw is preferred over yl: P(yw≻yl∣x)=σ(rθ(x,yw)−rθ(x,yl)) where σ is the sigmoid function. The model is trained by minimizing the negative log-likelihood of the human preference labels over the dataset: L(θ)=−E(x,yw,yl)∼D[logσ(rθ(x,yw)−rθ(x,yl))] This loss encourages the reward model rθ to output a larger difference in scores between the winning and losing responses.
The quality and calibration of this reward model are significant factors influencing the success of the entire RLHF process. We'll discuss challenges in reward modeling later in this chapter.
With a trained reward model rθ that supposedly captures human preferences, the final stage uses reinforcement learning to fine-tune the LLM policy πϕ (initially πSFT) to maximize the rewards assigned by rθ.
The RL Setup:
Optimization Process: An RL algorithm, most commonly Proximal Policy Optimization (PPO), is used to update the policy parameters ϕ. The process generally involves:
PPO Objective with KL Penalty: A critical aspect of RLHF is preventing the policy πϕ from deviating too drastically from the original SFT model πSFT (or the base pre-trained model if SFT was skipped). Over-optimizing solely for the learned reward rθ can lead to generating nonsensical or repetitive text that happens to score highly (a phenomenon known as "reward hacking") or catastrophic forgetting of general language capabilities. To mitigate this, a penalty term based on the Kullback-Leibler (KL) divergence between the current policy πϕ and the reference policy πref (usually πSFT) is added to the reward. The objective optimized by PPO in RLHF is typically: Objective(ϕ)=E(x,y)∼Dπϕ[rθ(x,y)−βKL(πϕ(y∣x)∣∣πref(y∣x))] Here, Dπϕ is the distribution of prompt-response pairs obtained by sampling prompts x and generating responses y∼πϕ(y∣x). The coefficient β controls the strength of the KL penalty. This term encourages the policy πϕ to maximize the reward predicted by rθ while staying relatively close to the behavior of the reference model πref, preserving language fluency and mitigating reward hacking.
The output of this stage is the final aligned LLM, πϕ∗, which has been tuned to generate responses that are preferred according to the learned reward model, balanced by the need to maintain coherence and stay close to the initial capabilities learned during pre-training and SFT.
This three-stage process (SFT -> Reward Modeling -> RL Fine-Tuning) forms the backbone of many successful RLHF implementations. Subsequent sections will explore the details, challenges, and variations associated with each stage, particularly reward modeling and PPO optimization.
© 2025 ApX Machine Learning