Supervised Fine-Tuning (SFT) serves as the foundational first step in the standard Reinforcement Learning from Human Feedback (RLHF) workflow. While a pre-trained Large Language Model (LLM) possesses broad knowledge, it often lacks the specific instruction-following capabilities, desired output formatting, or conversational style needed for alignment tasks. SFT addresses this gap by adapting the base LLM using a curated dataset of high-quality prompt-response pairs, often called demonstration data.
Think of the SFT phase as providing the model with its initial training on how to behave in the target context. It's not about learning preferences yet; it's about learning the basic rules of the game – how to respond to prompts, what format to use, and perhaps adopting a specific persona (e.g., a helpful assistant).
The primary technical role of SFT is to produce an initial policy, typically denoted as πSFT, which serves as the starting point for the subsequent Reinforcement Learning (RL) phase. The RL algorithm, usually Proximal Policy Optimization (PPO) in this context, doesn't start optimizing from the raw pre-trained model's parameters. Instead, it begins with the parameters of the SFT model.
This initialization offers significant advantages:
The diagram below illustrates how the SFT model fits into the overall process:
Flow diagram illustrating the three stages of RLHF. The SFT model (πSFT) produced in Stage 1 is used to initialize the RL policy (π0) and often serves as the reference policy for the KL penalty in Stage 3. It can also be used to generate initial responses for human comparison data collection in Stage 2.
Beyond just initializing the weights, the SFT phase establishes a behavioral baseline. The demonstration data teaches the model:
Crucially, the SFT model (πSFT) also typically serves as the reference policy for the KL divergence penalty used in the PPO algorithm during the RL phase. The PPO objective function includes a term like β⋅DKL(πRL∣∣πSFT), where πRL is the policy being trained. This KL term penalizes the RL policy for diverging too far from the initial SFT policy on a per-token basis.
Why is this important?
While not its primary role, the SFT model is often used in the data generation process for the subsequent reward modeling stage. To collect human preference data, prompts are fed to one or more models (often including the SFT model itself) to generate multiple candidate responses. Human labelers then compare these responses (e.g., choosing the better one in a pair). Therefore, having a competent SFT model facilitates the creation of relevant and diverse candidate responses needed to train an effective reward model.
In summary, the SFT phase is not merely a preliminary step but a foundational component of the RLHF process. It adapts the LLM to the target domain, initializes the policy for efficient RL training, establishes the desired behavioral format and style, and provides the reference point (πSFT) essential for stable and effective PPO optimization via the KL divergence constraint. A well-executed SFT phase significantly streamlines the subsequent, more complex stages of reward modeling and reinforcement learning.
© 2025 ApX Machine Learning