Having established the alignment challenge and the limitations of purely supervised methods, let's outline the standard Reinforcement Learning from Human Feedback (RLHF) process. This multi-stage approach is designed to steer Large Language Models (LLMs) towards generating outputs that better align with human preferences and instructions. It typically involves three distinct phases:

Supervised Fine-Tuning (SFT): This initial phase adapts a pre-trained LLM to the target domain or style using a curated dataset of high-quality prompts and desired responses. Think of this as teaching the model the basic format and style expected. The result is an "SFT model" which serves as the starting point for the subsequent reinforcement learning phase. While helpful, SFT alone often struggles to capture the full spectrum of human preferences, especially regarding complex instructions, safety, or subtle qualities like helpfulness. We discussed its limitations earlier.
Reward Modeling (RM): Since we want to optimize the model based on human preferences, but cannot query humans interactively during RL training due to cost and latency, we first train a separate model to predict human preferences. This is the Reward Model (RM). To train it, we collect human feedback data. Typically, humans are shown several responses generated by the SFT model (or other models) for a given prompt and asked to rank them or choose the best one. This preference data (e.g., pairs of responses where one is preferred over the other) is used to train the RM. The RM takes a prompt and a generated response as input and outputs a scalar reward score, which ideally correlates with how likely a human would prefer that response.
RL Fine-Tuning (PPO): In the final phase, the SFT model (now acting as the initial policy) is further fine-tuned using reinforcement learning. The environment consists of receiving prompts, generating responses, and getting rewards from the RM. The goal is to adjust the policy model's parameters to maximize the expected reward predicted by the RM, effectively teaching the model to generate responses that the RM scores highly (and thus, humans are likely to prefer). Proximal Policy Optimization (PPO) is commonly used here. A significant element in this phase is applying a constraint, often using Kullback-Leibler (KL) divergence, between the updated policy and the original SFT policy. This KL penalty prevents the RL process from deviating too drastically from the SFT model's learned knowledge and distribution, helping maintain language coherence and preventing the policy from "over-optimizing" for the reward model in unrealistic ways (reward hacking).

This sequence forms the core RLHF pipeline. Each stage builds upon the previous one, starting with a generally capable model, training a preference predictor, and finally optimizing the model against that predictor.

Diagram illustrating the three primary stages of the RLHF process: Supervised Fine-Tuning (SFT), Reward Modeling (RM), and Reinforcement Learning (RL) Fine-Tuning using PPO. The diagram shows the flow of models and data between these stages.

Subsequent chapters will detail the implementation specifics, challenges, and variations associated with each of these stages, starting with a closer look at the SFT phase. Understanding this overall structure provides the necessary context for the technical deep dives that follow.