Having established the concept of alignment and its associated challenges, let's review the most common foundational techniques used to steer Large Language Model (LLM) behavior: instruction following and supervised fine-tuning (SFT). While this course focuses on advanced methods, understanding these initial steps is essential context.
Raw, pre-trained LLMs are typically optimized for next-token prediction based on massive, unstructured text corpora. Their objective is often maximizing the likelihood of the training data under the model, typically via minimizing a cross-entropy loss:
Lpretrain(θ)=−i∑logP(xi∣x<i;θ)where θ represents the model parameters and xi are tokens in the pre-training corpus. This process results in models with broad linguistic knowledge and generative capabilities but without specific instruction-following abilities or inherent adherence to safety guidelines. They learn grammar, facts, and reasoning patterns from the data but lack a specific goal beyond plausible text completion.
To make pre-trained models more useful and controllable, a standard practice is Instruction Fine-tuning (IFT). This is a supervised learning phase where the model is further trained on a dataset composed of instructional prompts and desired responses.
The dataset DIFT takes the form of structured examples:
DIFT={(promptk,completionk)}k=1NExamples might include:
prompt
: "Translate the following sentence to Spanish: 'The weather is nice today.'", completion
: "El clima está agradable hoy.")prompt
: "Summarize the main points of this paragraph: [Paragraph Text]", completion
: "[Concise Summary]")prompt
: "Write Python code to reverse a string.", completion
: def reverse_string(s):\n return s[::-1]
)The optimization objective during IFT is to adjust the model parameters θ (starting from θpretrain) to minimize the negative log-likelihood of generating the target completion
tokens, given the prompt
:
Here, ck=(ck,1,...,ck,∣ck∣) is the token sequence of the desired completion_k
. Essentially, the model learns: "When you see input like promptk, produce output like completionk."
Basic workflow of Instruction Fine-tuning (IFT), adapting a pre-trained model using prompt-completion pairs.
IFT teaches the model the format of interaction (understanding instructions and providing relevant answers) and imbues it with specific capabilities reflected in the fine-tuning data.
IFT is a specific type of Supervised Fine-tuning (SFT). More generally, SFT involves adapting a pre-trained model using any dataset of input-output pairs (x,y), minimizing a loss function (like cross-entropy) computed on the target output y. Beyond instruction following, SFT can be used for:
IFT and SFT are fundamental steps towards achieving outer alignment. They directly shape the model's observable behavior by optimizing it to mimic the provided examples. If the fine-tuning dataset consists of helpful, honest, and harmless examples, the model learns to produce similar outputs.
However, relying solely on SFT/IFT for alignment has significant limitations, motivating the advanced techniques discussed later:
In summary, IFT and SFT are powerful tools for making LLMs follow instructions and adopt specific knowledge or styles. They form the bedrock upon which more sophisticated alignment techniques like RLHF are often built. However, their limitations in handling nuanced preferences, ensuring robustness, and preventing specification gaming necessitate the advanced methods explored throughout this course. They primarily address what the model should output based on examples, rather than optimizing directly for the underlying principles of desired behavior.
© 2025 ApX Machine Learning