Supervised Fine-Tuning (SFT) is often the first step in adapting a pre-trained Large Language Model (LLM) towards desired behaviors. By training the model on a dataset of high-quality prompt-response pairs (demonstrations), SFT teaches the model to follow instructions, adopt a specific style, or perform tasks illustrated in the examples. It's effective for imparting foundational capabilities and aligning the model with explicit, well-defined tasks where a "correct" output can be clearly demonstrated.
However, relying solely on SFT for achieving comprehensive alignment with human intent and values runs into significant limitations. These shortcomings are primary motivators for employing Reinforcement Learning from Human Feedback (RLHF).
Scalability and Coverage of Demonstrations
Creating a high-quality SFT dataset that covers the sheer breadth and depth of human expectations is a formidable challenge. Consider the vast space of potential user prompts and the nuances required in responses:
- Cost and Effort: Generating ideal, human-written responses for every conceivable scenario, including edge cases and complex reasoning tasks, is prohibitively expensive and time-consuming.
- Vast Input Space: LLMs can be prompted in countless ways. An SFT dataset, no matter how large, can only represent a small fraction of this space. The model may perform well on inputs similar to its training data but fail unpredictably on out-of-distribution prompts or slight variations of familiar ones.
- Implicit Knowledge: Many aspects of desired behavior (e.g., common sense, avoiding subtle social biases) are hard to capture exhaustively through explicit demonstrations alone. SFT might teach the model what to say in specific instances but not the underlying why.
Think of SFT as teaching grammar and vocabulary rules by example. While essential, it doesn't automatically teach someone how to write insightful analysis, compelling narratives, or ethically sound arguments in novel situations. That requires a different kind of learning signal.
Difficulty in Defining "Goodness" Objectively
For many alignment goals, specifying a single, perfect "gold standard" response for SFT is difficult, if not impossible:
- Subjectivity: What constitutes the "best" response often depends on context, individual user preferences, or cultural norms. Is a concise answer better than a detailed one? Is a cautious tone preferable to a confident one? SFT forces a choice, potentially leading to a model that satisfies some users but not others.
- Multiple Objectives: Desired LLM behavior often involves balancing multiple, sometimes competing, objectives: being helpful, harmless, honest, engaging, concise, etc. Crafting a single SFT demonstration that perfectly optimizes all these factors is challenging.
- Comparative Ease: Humans often find it much easier to compare two outputs and state a preference (e.g., "Response A is more helpful than Response B") than to author Response A from scratch. SFT cannot directly use this comparative preference signal, which is often richer and easier to elicit for complex tasks.
For example, asking an LLM to "explain deep learning to a 5-year-old" could yield several reasonable, creative, yet different responses. SFT would typically train the model towards one specific example, whereas a preference-based approach could learn the qualities that make any such explanation good (simplicity, use of analogy, accuracy).
Comparison of the data signal used in SFT versus the preference data central to RLHF. SFT relies on absolute examples, while RLHF learns from relative comparisons.
Specifying Complex or Implicit Goals
Alignment goals like "act harmlessly," "be honest," or "avoid generating misinformation" are notoriously difficult to specify comprehensively through SFT demonstrations alone.
- Abstract Principles: Harmlessness isn't just about avoiding explicitly toxic content; it involves subtle aspects like avoiding harmful stereotypes, refusing dangerous instructions gracefully, and not providing misleading information that could cause indirect harm. Demonstrating all facets of such abstract principles via input-output pairs is impractical.
- Negative Constraints: It's often easier to define what the model shouldn't do. While SFT can include examples of refusals, it struggles to generalize the underlying reason for refusal across novel harmful prompts.
- Superficial Mimicry: The model might learn surface-level patterns from SFT data without internalizing the intended principle. For instance, it might learn to use hedging language ("It appears that...") because such phrases appeared in "honest" examples, but apply it inappropriately, failing to be genuinely truthful or admit uncertainty when needed.
Overfitting and Loss of Generality
Intensive SFT on a specific dataset can lead to the model overfitting to the style, tone, and specific knowledge contained within those demonstrations.
- Mode Collapse: The model might lose some of its original generative capabilities or creativity, tending to produce outputs that closely resemble the SFT examples even when variations are appropriate.
- Brittleness: While performing well on tasks seen during SFT, the model might become less robust or perform poorly when faced with slightly different tasks or phrasing.
These limitations highlight that while SFT is a valuable tool for basic adaptation, it's insufficient for achieving the deep, reliable, and generalizable alignment required for advanced LLMs interacting with humans in open-ended ways. The need to incorporate a broader, more scalable, and nuanced signal of human preference motivates the move towards methods like RLHF, which leverage comparative feedback to guide the model towards more desirable behaviors. The next chapters will detail how this preference signal is collected, modeled, and used within a reinforcement learning framework.