Supervised Fine-tuning (SFT) is a common technique for adapting a pre-trained Large Language Model (LLM) to follow instructions or specialize in a particular domain. The core idea is straightforward: provide the model with examples of desired input-output behavior and train it to replicate that behavior using standard supervised learning. However, the effectiveness of SFT hinges significantly on how you structure and format the training data. The model needs unambiguous signals to understand what part of the input is the prompt (or instruction) and what part is the expected response it should learn to generate.
At its heart, SFT requires data presented as pairs of (prompt, completion). The model is trained to generate the completion
when given the corresponding prompt
. How you define these prompts and completions, and how you stitch them together, is critical.
While various custom formats exist, most SFT datasets converge on a few common structures, often stored in formats like JSON Lines (JSONL), where each line is an independent JSON object representing one training example.
Simple Prompt-Completion Pairs: This is the most basic format, suitable for tasks where the input is a single piece of text and the output is its continuation or transformation.
{"prompt": "Translate to French: 'Hello, world!'", "completion": "Bonjour le monde!"}
{"prompt": "Summarize the following text: [Long text input]...", "completion": "[Concise summary]..."}
During training, the prompt
and completion
are often concatenated, sometimes with a separator token, and the model learns to predict the tokens belonging to the completion
.
Instruction-Following Format: To explicitly train models to follow instructions, datasets often break down the input further, separating the instruction itself from any specific input data it should operate on. This helps the model generalize better to new instructions.
{"instruction": "Identify the main sentiment.", "input": "The movie was fantastic!", "output": "Positive"}
{"instruction": "Write a short story about a brave knight.", "input": "", "output": "Sir Gideon adjusted his helmet, the dragon's roar echoing in the valley..."}
{"instruction": "Extract the email addresses.", "input": "Contact us at info@example.com or support@example.org.", "output": "info@example.com, support@example.org"}
When preparing this data for the model, these fields are typically combined into a single prompt string using a predefined template. For example:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}
The model is then trained to generate the text following ### Response:
. The specific template structure can vary, but consistency within the dataset is important.
Chat and Conversational Format:
For training chat models or assistants, the data needs to represent multi-turn dialogues. This is often structured as a list of turns, each with a designated role (e.g., user
, assistant
, system
).
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]}
{"messages": [
{"role": "user", "content": "Write a python function to calculate factorial."},
{"role": "assistant", "content": "```python\ndef factorial(n):\n if n == 0:\n return 1\n else:\n return n * factorial(n-1)\n```"}
]}
Processing this format for SFT involves serializing the conversation history into a single linear sequence. This often requires special tokens to delineate turns and roles. For instance, a model might expect input formatted like: <|im_start|>system\nYou are...<|im_end|>\n<|im_start|>user\nWhat is...?<|im_end|>\n<|im_start|>assistant\nThe capital is...<|im_end|>
. The model is then trained to predict only the tokens corresponding to the assistant
messages.
Many LLMs rely on special tokens to structure the input sequence effectively during both pre-training and fine-tuning. These tokens act as delimiters, signaling the boundaries between instructions, user input, model responses, or conversational turns. Examples include:
<s>
and </s>
: Beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens.[INST]
and [/INST]
: Used by models like Llama 2-Chat to encapsulate user instructions.<|im_start|>
and <|im_end|>:
Used by models like ChatML (common with OpenAI models and adapted elsewhere) to mark the beginning and end of messages, often paired with role identifiers (system
, user
, assistant
).<|user|>
, <|assistant|>
, <|endoftext|>
, [SEP]
, [CLS]
: Various other tokens used by different models.It is extremely important to format your SFT data using the specific template and special tokens that the base model expects. Using the wrong format or omitting required special tokens can lead to significantly degraded performance, as the model fails to correctly interpret the structure of the input it receives. Always consult the documentation or model card for the base LLM you are fine-tuning.
A fundamental aspect of SFT is ensuring the model only learns to predict the target completion or response, not the prompt or instruction text it was given. If the model were trained to predict the entire concatenated sequence (prompt + completion), the loss calculation would include errors made predicting the prompt tokens. This is undesirable; we want the model's gradients to be based solely on its ability to generate the desired output.
This is achieved through loss masking. During the training process, when calculating the cross-entropy loss, the loss values corresponding to the prompt tokens are ignored. A common practice in frameworks like PyTorch is to assign a special ignore index (e.g., -100) to the labels (target token IDs) that correspond to the prompt tokens. The loss function then automatically skips these positions.
Consider a simplified example:
Prompt: Translate: Hello
Completion: Bonjour
Combined input string (conceptual): Translate: Hello Bonjour
Tokenized IDs (example): [101, 8991, 102, 156 Hello, 205 Bonjour, 103]
(Indices are illustrative)
Target Labels for Loss Calculation (with masking): [-100, -100, -100, -100, 205, 103]
Here, -100
indicates that the loss should not be computed for the tokens corresponding to "Translate:", "Hello", and any initial separator/special tokens. The loss is only calculated based on the model's prediction for the tokens "Bonjour" and the subsequent end token.
Conceptual flow showing how raw instruction/input/output data is transformed into a formatted string, tokenized, and then how labels are masked to ensure the model only learns from the target response tokens during Supervised Fine-tuning.
In a typical SFT pipeline using libraries like Hugging Face's transformers
and datasets
:
Dataset
object.labels
array by copying the tokenized input_ids
and replacing the IDs corresponding to the prompt section with the ignore index (e.g., -100).Careful attention to these formatting steps is essential for successful SFT. Consistency in formatting, correct use of special tokens, and proper loss masking are foundational elements for effectively adapting LLMs to specific tasks and instruction-following behaviors.
© 2025 ApX Machine Learning