Masterclass
While pre-trained large language models (LLMs) demonstrate impressive capabilities in understanding and generating human language based on patterns learned from vast datasets, they don't automatically behave in ways that are consistently useful, truthful, or safe for specific applications. Their training objective, typically focused on predicting the next token in a sequence, P(tokeni+1​∣token1​,...,tokeni​), maximizes likelihood on the pre-training corpus but doesn't directly optimize for following user instructions or adhering to human values.
Alignment is the process of refining a pre-trained LLM to better match human intent and preferences. It aims to steer the model's powerful generative abilities towards desired behaviors. Supervised Fine-Tuning (SFT), the focus of this chapter, represents a significant first step in this process. The overarching goals of alignment, which SFT begins to address, are often categorized into three broad areas, sometimes referred to as the "HHH" criteria: Helpfulness, Honesty, and Harmlessness.
This is perhaps the most direct goal addressed by SFT. A helpful model should understand and accurately follow the user's instructions as presented in the prompt. It should perform the requested task effectively, whether it's answering a question, summarizing text, writing code, translating languages, or engaging in a specific conversational style.
Consider a pre-trained model asked, "Explain the concept of gradient descent."
SFT achieves this by exposing the model to numerous examples of prompts paired with high-quality, helpful responses. The fine-tuning process adjusts the model's parameters to increase the probability of generating such helpful responses for similar prompts. This involves minimizing the loss (e.g., cross-entropy) between the model's generated response and the target helpful response in the SFT dataset.
Alignment transforms a general-purpose pre-trained model into one that exhibits desired behaviors through techniques like SFT.
An aligned model should strive for accuracy and avoid generating fabricated information, often referred to as "hallucinations." While pre-training exposes the model to factual knowledge, the generative nature means it can easily construct plausible-sounding but incorrect statements. Honesty implies:
SFT can contribute to honesty by including examples where the model correctly answers factual questions or explicitly states its limitations. However, ensuring deep factuality and calibrated uncertainty often requires more advanced techniques beyond basic SFT, such as incorporating retrieval mechanisms or using reinforcement learning (like RLHF, discussed in Chapter 26) to penalize untruthful outputs identified by human feedback.
This goal focuses on preventing the model from generating output that is harmful, unethical, prejudiced, toxic, or promotes illegal activities. Pre-training data inevitably contains biases and harmful content present on the internet and in digitized texts. An unaligned model might readily reproduce or amplify these issues. Harmlessness requires the model to:
SFT plays a role here by including examples where the model refuses harmful requests or provides safe, neutral responses. Carefully curated SFT datasets filter out undesirable examples and explicitly demonstrate safe refusals. Similar to honesty, achieving robust harmlessness across diverse and adversarial inputs is challenging and often benefits significantly from subsequent RLHF, where models are trained to prefer safe outputs based on human judgments.
In summary, alignment aims to make LLMs not just capable, but also beneficial and safe partners in various applications. SFT serves as a foundational technique, primarily enhancing helpfulness and instruction-following, while also starting the process of instilling honesty and harmlessness by providing concrete examples of desired model outputs. These goals guide the creation of SFT datasets and the evaluation of aligned models, ensuring they move beyond simply predicting text to generating truly useful and responsible responses.
© 2025 ApX Machine Learning