While supervised fine-tuning (SFT) excels at teaching models to follow specific instructions based on demonstrated examples, achieving alignment with more nuanced human preferences like helpfulness, honesty, and harmlessness often requires a different approach. These desired qualities can be difficult to capture solely through input-output examples. This is where Reinforcement Learning from Human Feedback (RLHF) comes into play. RLHF provides a framework for optimizing language models based on preferences expressed by humans, moving beyond simple imitation towards optimizing for desired behavioral characteristics.
RLHF is a multi-stage process designed to fine-tune a language model using rewards derived from human judgments. It allows us to steer the model's behavior more directly than SFT alone. The core idea is to first train a separate 'reward model' that learns to predict which responses humans prefer, and then use this reward model within a reinforcement learning loop to update the language model itself.
The typical RLHF workflow consists of three main phases:
-
Initial Model Preparation (Supervised Fine-tuning - SFT): Although not strictly part of RLHF itself, the process usually starts with a pre-trained language model that has already undergone supervised fine-tuning on a high-quality instruction dataset. This initial SFT step provides a strong baseline model, πSFT, capable of following instructions reasonably well. This model serves as the starting point for the subsequent RLHF stages.
-
Reward Model (RM) Training:
- Goal: The objective is to train a model, rϕ(x,y), parameterized by ϕ, that takes a prompt x and a generated response y as input and outputs a scalar score representing human preference for that response. A higher score indicates a more preferred response.
- Data Collection: This requires collecting human preference data. Typically, for a given prompt x, several responses (y1,y2,...,yk) are generated by the SFT model. Human labelers are then asked to rank these responses from best to worst or, more commonly, to perform pairwise comparisons, selecting the preferred response (yw, winner) over another (yl, loser). This creates a dataset D of preference tuples (x,yw,yl).
- Training: The reward model is often initialized using the SFT model's weights, with the final token prediction layer replaced by a regression head outputting a scalar score. It is then trained on the preference dataset D. A common objective function aims to maximize the difference in scores between preferred and rejected responses, often using a loss based on the Bradley-Terry model for pairwise comparisons:
L(ϕ)=−E(x,yw,yl)∼D[log(σ(rϕ(x,yw)−rϕ(x,yl)))]
Here, σ is the sigmoid function. This loss encourages the reward model rϕ to assign a significantly higher score to the winning response yw compared to the losing response yl.
-
RL Fine-tuning (Policy Optimization):
- Goal: The final stage uses reinforcement learning to fine-tune the SFT language model (now considered the policy, πθ) to generate responses that maximize the expected reward predicted by the trained reward model rϕ.
- Process: The RL loop works as follows:
a. A prompt x is sampled from a dataset (often the SFT dataset or a custom prompt set).
b. The current LLM policy πθ(y∣x) generates a response y.
c. The reward model rϕ(x,y) assigns a reward score to the generated response.
d. This reward signal is used to update the parameters θ of the LLM policy πθ.
- Optimization Algorithm (PPO): Proximal Policy Optimization (PPO) is the most commonly used algorithm for this stage. Directly maximizing the reward rϕ(x,y) can lead the policy πθ to generate repetitive or nonsensical text that "hacks" the reward model (reward hacking) or deviates too far from coherent language generation learned during pre-training and SFT. PPO addresses this by incorporating a constraint that penalizes large deviations of the current policy πθ from the original SFT policy πSFT. The objective function optimized by PPO typically looks like this:
Objective(θ)=Ex∼Dprompt,y∼πθ(y∣x)[rϕ(x,y)−βKL(πθ(y∣x)∣∣πSFT(y∣x))]
Here, E denotes expectation, Dprompt is the distribution of prompts, KL represents the Kullback-Leibler divergence between the policies, and β is a coefficient controlling the strength of the KL penalty. This objective encourages the policy πθ to achieve high rewards from rϕ while staying relatively close to the distribution defined by the initial SFT model πSFT, thus maintaining general language capabilities and mitigating catastrophic forgetting or policy collapse.
The overall RLHF process can be visualized as follows:
Workflow of Reinforcement Learning from Human Feedback (RLHF), showing the progression from an initial SFT model to training a reward model based on human preferences, and finally using PPO to fine-tune the language model policy guided by the reward model.
RLHF represents a powerful technique for aligning LLMs with complex, subjective, or safety-critical human values that are hard to specify through examples alone. However, it introduces significant complexity compared to SFT. Collecting high-quality human preference data is expensive and time-consuming. The performance of the final model is highly dependent on the quality and calibration of the reward model, and the RL training process itself can be unstable and sensitive to hyperparameter choices. Issues like reward hacking, where the policy finds loopholes in the reward model, require careful monitoring and mitigation, often involving the KL penalty term in PPO.
In the context of advanced fine-tuning strategies, RLHF is often employed after initial instruction tuning (SFT) to further refine model behavior, making it more helpful, less prone to generating harmful content, and better aligned with user intentions. It complements techniques like multi-task learning by providing a mechanism for optimizing towards holistic behavioral goals defined by human preferences rather than specific task formats.