While Supervised Fine-Tuning (SFT) provides a strong foundation for instruction following, aligning a model more closely with complex human preferences often requires Reinforcement Learning from Human Feedback (RLHF). The RLHF process involves training a model based on which outputs humans judge to be better, allowing for optimization towards qualities like helpfulness and harmlessness that are difficult to capture solely through supervised examples. The standard RLHF pipeline typically consists of three main stages, building upon each other:
-
Initial Model Preparation (Pre-training & SFT): Start with a capable pre-trained language model. This model is then fine-tuned using supervised learning on a dataset of high-quality prompt-response pairs (as discussed in Chapter 25). This SFT step adapts the model to follow instructions and generate responses in the desired style and format. This SFT model serves as the starting point for the subsequent RLHF stages.
-
Reward Model (RM) Training: The core idea is to learn a model that can predict human preferences.
- Data Collection: Select a set of diverse prompts. For each prompt, generate several responses using the SFT model (or multiple model variants). Human labelers are then presented with pairs (or sometimes more) of these responses and asked to choose which one they prefer based on criteria like helpfulness, accuracy, or safety.
- RM Architecture: The reward model (RM) is typically another language model (often initialized from the SFT model or a smaller pre-trained model) whose final layer is replaced with a linear layer outputting a single scalar value representing the predicted preference score. It takes a prompt and a response as input and outputs this score.
- Training Objective: The RM is trained on the collected pairwise preference data. Given a prompt p and two responses yw (winner) and yl (loser) where yw was preferred over yl by a human, the RM is trained to assign a higher score to yw than to yl. A common loss function is the pairwise ranking loss:
LRM=−E(p,yw,yl)∼D[log(σ(rθ(p,yw)−rθ(p,yl)))]
Here, rθ(p,y) is the scalar score output by the reward model with parameters θ for prompt p and response y, σ is the sigmoid function, and D is the dataset of human preferences. This objective maximizes the probability that the preferred response yw gets a higher score.
-
Reinforcement Learning (RL) Fine-Tuning: The trained RM now acts as a proxy for human preferences, providing reward signals to fine-tune the SFT model further using an RL algorithm, typically Proximal Policy Optimization (PPO).
- RL Setup: The SFT model acts as the initial policy (πSFT) in the RL framework. The action space consists of the possible tokens the model can generate, and the state is the sequence of tokens generated so far.
- Optimization Loop: The process iteratively samples prompts p from the dataset. The current RL policy (πRL) generates a response y. The RM assigns a reward r=rθ(p,y) to the generated response. The PPO algorithm then uses this reward signal to update the weights of the policy πRL.
- KL Divergence Penalty: A significant challenge in RL fine-tuning is that the policy might learn to generate outputs that maximize the RM score but deviate significantly from the original SFT model's distribution, potentially leading to nonsensical or repetitive text ("reward hacking"). To mitigate this, PPO in RLHF incorporates a Kullback-Leibler (KL) divergence penalty term into the reward function or objective. The objective aims to maximize the RM score while staying close to the initial SFT policy:
Objective=E(p,y)∼πRL[rθ(p,y)−β⋅KL(πRL(⋅∣p)∣∣πSFT(⋅∣p))]
The term KL(πRL(⋅∣p)∣∣πSFT(⋅∣p)) measures the divergence between the token distributions predicted by the RL policy and the original SFT policy for prompt p. The hyperparameter β controls the strength of this penalty, preventing the RL policy from straying too far from the learned distribution of the SFT model, thus maintaining coherence and language quality.
The following diagram illustrates this multi-stage process:
The three stages of the standard RLHF pipeline: Supervised Fine-Tuning (SFT), Reward Model (RM) training based on human preferences, and Reinforcement Learning (PPO) fine-tuning guided by the RM and a KL penalty against the original SFT model.
This pipeline allows the model to learn from comparative human feedback, refining its ability to generate responses that align better with desired attributes than what can be achieved through SFT alone. The following sections will detail the practical aspects of collecting preference data, training the reward model, and implementing the PPO fine-tuning step.