Masterclass
Once you have curated or generated a dataset of high-quality instruction-response pairs, the next step is to format this data in a way the model can effectively learn from during Supervised Fine-Tuning (SFT). The core idea of SFT is to teach the model to generate a desired output (the completion) when given a specific input (the prompt). Proper formatting ensures the model understands the task structure and focuses its learning on generating the correct completion.
At its simplest, each SFT example consists of two parts:
The model is trained to predict the tokens of the completion
sequentially, given the prompt
. Consider a straightforward question-answering example:
During training, the model processes the prompt and learns to associate it with the target completion. The exact text used for the prompt can vary significantly depending on the task and the desired interaction style. It might include explicit instructions, examples (in few-shot scenarios), or conversational history.
The structure needs to adapt to the specific behavior you want the model to learn.
For tasks where the model should follow an explicit command, the prompt clearly states the instruction, often followed by the input data.
# Example 1: Summarization
Prompt: "Summarize the following article:\n\n[Article text here...]\n\nSummary:"
Completion: "[Concise summary of the article]"
# Example 2: Code Generation
Prompt: "Write a Python function that calculates the factorial of a number.\n```python\n"
Completion: "def factorial(n):\n if n == 0:\n return 1\n else:\n return n * factorial(n-1)\n```"
To train conversational agents, the format must represent the back-and-forth nature of dialogue. This often involves using special tokens or markers to delineate user and assistant turns.
# Example Dialogue Format using special role tokens
Prompt: "<|USER|> Hello, can you explain photosynthesis?\n<|ASSISTANT|>"
Completion: " Photosynthesis is the process used by plants, algae, and cyanobacteria to convert light energy into chemical energy..."
# Example Multi-turn Dialogue Format
Prompt: "<|USER|> What's the weather like in London?\n<|ASSISTANT|> It's currently cloudy and 15°C in London.\n<|USER|> What about tomorrow?\n<|ASSISTANT|>"
Completion: " Tomorrow's forecast for London is partly sunny with a high of 18°C."
Using distinct markers like <|USER|>
and <|ASSISTANT|>
helps the model learn the conversational structure and identify whose turn it is to speak.
For tasks requiring reasoning, the completion itself might include the intermediate steps leading to the final answer. This teaches the model how to arrive at the solution.
Prompt: "Question: John has 5 apples. He buys 3 more boxes, each containing 4 apples. How many apples does he have in total?\nAnswer:"
Completion: " John starts with 5 apples. He buys 3 boxes * 4 apples/box = 12 apples. In total, he has 5 + 12 = 17 apples. The final answer is 17."
To help the model clearly distinguish between the prompt and the completion within the concatenated input sequence, special tokens are often used. These can be pre-defined tokens from the model's tokenizer (like [SEP]
, </s>
, <|endoftext|>
) or custom tokens added specifically for SFT (like <|PROMPT|>
, <|COMPLETION|>
, <|END_OF_TURN|>
).
Consider a simplified example using generic tokens:
import torch
from transformers import AutoTokenizer
# Assume tokenizer is already loaded
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Example tokenizer
# Add special tokens if needed (check if they exist first)
special_tokens_dict = {'sep_token': '<|SEP|>', 'pad_token': '<|PAD|>'}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
# Remember to resize model embeddings if new tokens were added
prompt = "Translate to French: Hello world"
completion = " Bonjour le monde"
# Simple formatting using a separator token
formatted_text = (
f"{prompt}{tokenizer.sep_token}"
f"{completion}{tokenizer.eos_token}"
)
# Tokenize the formatted text
tokenized_input = tokenizer(formatted_text, return_tensors="pt")
print("Formatted Text:", formatted_text)
print("Token IDs:", tokenized_input['input_ids'])
# Output might look like (token IDs depend on the exact tokenizer):
# Formatted Text: Translate to French: Hello world<|SEP|>
# Bonjour le monde<|endoftext|>
# Token IDs: tensor([[ 14685, 284, 10607, 35, 995, 11858, 50257,
# 40195, 259, 813, 50256]])
The choice of separator affects how the model segments the input during training and inference. Consistency in applying these tokens across the dataset is important.
A critical aspect of SFT training is ensuring that the loss is calculated only on the completion tokens. The model should learn to predict the desired output, not to predict the prompt itself (which it already received as input). This is typically achieved by using a label mask or modifying the attention mask.
When preparing the batch for the model, the input IDs will contain the concatenated prompt and completion tokens. The labels (targets for the loss function) are usually a shifted version of the input IDs. We need to tell the loss function (e.g., CrossEntropyLoss) to ignore the losses calculated for the prompt tokens.
import torch
import torch.nn.functional as F
# Assume tokenized_input contains prompt + separator + completion + eos
# input_ids = tokenized_input['input_ids'] # Shape: [batch_size, sequence_length]
# Example: A single sequence's IDs
# Prompt: "Q: Why sky blue? <|SEP|>" -> IDs [10, 20, 30, 40, 50] (length 5)
# Completion: " Scattering <|EOS|>" -> IDs [60, 70, 80] (length 3)
# Concatenated: [10, 20, 30, 40, 50, 60, 70, 80] (length 8)
input_ids = torch.tensor([[10, 20, 30, 40, 50, 60, 70, 80]])
# Labels are typically input_ids shifted right
# The model predicts the *next* token at each position
# labels = [10, 20, 30, 40, 50, 60, 70, 80] -> shifted
# -> [20, 30, 40, 50, 60, 70, 80, <PAD>]
# Or more commonly: [-100, -100, -100, -100, -100, 60, 70, 80]
# where -100 is ignore_index
labels = torch.tensor([[-100, -100, -100, -100, -100, 60, 70, 80]]) # Mask prompt tokens
# Assume model_output has shape [batch_size, sequence_length, vocab_size]
# Example dummy output logits
vocab_size = 100
sequence_length = input_ids.shape[1]
model_output_logits = torch.randn(1, sequence_length, vocab_size)
# Calculate loss
# Reshape for CrossEntropyLoss: needs (N, C) and (N)
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=-100)
# -100 is the default ignore index
loss = loss_fct(
model_output_logits.view(-1, vocab_size),
labels.view(-1)
)
print("Calculated Loss (only on completion tokens):", loss.item())
In this snippet, setting the label values corresponding to the prompt tokens to -100
(the default ignore_index
for PyTorch's CrossEntropyLoss) ensures that these positions do not contribute to the loss calculation or gradient updates. Only the model's predictions for the completion tokens ([60, 70, 80]
) are penalized.
Typically, the prompt and completion are concatenated into a single sequence fed to the model, often separated by a special token, as shown previously. The model then processes this entire sequence.
A common structure for SFT input sequences. The prompt and completion are concatenated, often with separator and end-of-sequence tokens. Loss is typically calculated only on the completion part.
Maintaining absolute consistency in formatting across all examples in your SFT dataset is essential. Inconsistent use of whitespace, newlines, or special tokens can confuse the model and hinder its ability to learn the desired pattern. Choose a format and apply it uniformly.
SFT datasets are often stored in formats like JSON Lines (.jsonl
), where each line is a JSON object representing one example.
{
"prompt": "Classify the sentiment: 'This movie was fantastic!'\nSentiment:",
"completion": " Positive"
}
{
"prompt": "Write a short poem about the moon.\nPoem:",
"completion": " Silver disk in velvet night,\nCasting shadows, soft and light"
}
{
"prompt": "<|USER|> What is the boiling point of water in Celsius?\n<|ASSISTANT|>",
"completion": " The boiling point of water is 100 degrees Celsius"
}
Alternatively, structured formats might separate instruction, input, and output:
{
"instruction": "Translate the following English text to Spanish.",
"input": "The weather is nice today.",
"output": " Hace buen tiempo hoy."
}
The chosen structure should map clearly onto the prompt-completion format used during tokenization and training.
A practical consideration is the maximum sequence length supported by the model. If the concatenated prompt and completion exceed this limit, you need a truncation strategy. Common approaches include:
The best strategy depends on the specific task and the relative importance of the prompt versus the completion content.
By carefully formatting your SFT data, clearly delineating prompts and completions, and ensuring consistency, you provide the model with the structured input it needs to effectively learn aligned behaviors like instruction following and helpful dialogue.
© 2025 ApX Machine Learning