Alright, let's put theory into practice. Having discussed the importance of Supervised Fine-Tuning (SFT) for initializing our policy model, we will now walk through the steps to actually perform SFT on a pre-trained language model. This practical exercise uses libraries from the Hugging Face ecosystem, specifically transformers
, datasets
, and trl
(Transformer Reinforcement Learning), which significantly simplifies the process.
Our goal is to take a general-purpose Large Language Model (LLM) and fine-tune it on a dataset of high-quality prompt-response pairs. This adapted model will then serve as a better starting point for the subsequent reinforcement learning phase.
First, ensure you have the necessary libraries installed (pip install transformers datasets trl torch accelerate bitsandbytes
). We begin by importing the required components. We will use PyTorch in these examples.
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig # Optional: for quantization
)
from trl import SFTTrainer
import os
# Optional: Configure GPU usage if available
# os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Set to your desired GPU ID
We need a pre-trained language model to fine-tune. For demonstration purposes, we'll use a smaller, publicly available model like gpt2
. In a real-world scenario, you would likely use a larger, more capable model that matches your target application's requirements. We also load the corresponding tokenizer, which is responsible for converting text into numerical representations the model understands.
To manage memory, especially with larger models, we can use quantization techniques like bitsandbytes
. This is optional but often helpful.
# Model identifier from Hugging Face Hub
base_model_id = "gpt2" # Replace with your model, e.g., "NousResearch/Llama-2-7b-hf"
# Optional: Quantization configuration
# bnb_config = BitsAndBytesConfig(
# load_in_4bit=True,
# bnb_4bit_quant_type="nf4",
# bnb_4bit_compute_dtype=torch.bfloat16
# )
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
# Set pad token if not set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load the model
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
# quantization_config=bnb_config, # Uncomment if using quantization
device_map="auto", # Automatically loads model across available GPUs/CPU
trust_remote_code=True
)
model.config.use_cache = False # Recommended for fine-tuning
SFT requires a dataset where each example represents a high-quality interaction. The trl
library's SFTTrainer
is flexible, but a common format is a dataset with a column (e.g., named 'text') containing the full prompt and response, often structured with special tokens.
Let's assume we have a dataset (perhaps loaded from Hugging Face Hub or a local file) where each entry looks like this:
{"text": "### Prompt: Explain the concept of photosynthesis in simple terms.\n### Response: Photosynthesis is the process plants use to turn sunlight, water, and air into food (sugar) for themselves, releasing oxygen as a byproduct."}
{"text": "### Prompt: Write a short poem about a rainy day.\n### Response: Silver drops on window pane,\nWhispering secrets of the rain.\nGray clouds drift in skies above,\nA cozy day for hearth and love."}
You can load datasets using the datasets
library. For this example, let's simulate loading a fictional dataset compatible with SFTTrainer
. A popular choice for demonstration is the databricks/databricks-dolly-15k
dataset, filtered for instruction-following examples.
# Example using a public dataset (requires 'pip install pyarrow')
# Make sure dataset has a 'text' column or adapt using formatting_func
dataset_name = "databricks/databricks-dolly-15k"
# For demonstration, we'll just use a small part of the training set
dataset = load_dataset(dataset_name, split="train[:500]") # Using first 500 examples
# --- Simple Data Preprocessing Example (if needed) ---
# If your dataset columns are named differently (e.g., 'instruction', 'response'),
# you might need to format them into a single 'text' field.
# def format_dolly(example):
# # Adjust formatting based on your specific dataset structure
# if example.get("context"):
# return f"### Instruction:\n{example['instruction']}\n\n### Context:\n{example['context']}\n\n### Response:\n{example['response']}"
# else:
# return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
# Apply formatting if the dataset doesn't have a 'text' field ready
# dataset = dataset.map(lambda x: {"text": format_dolly(x)})
# --------------------------------------------------------
# We need to rename the columns for the Dolly dataset to fit SFTTrainer expectation
# if 'text' column doesn't exist. Let's assume Dolly has 'instruction' and 'response'.
# SFTTrainer can often handle this directly if specified.
# We'll demonstrate renaming for clarity if needed.
# If dataset already has 'text', skip renaming.
# Check if 'text' column exists, otherwise try to create it
if 'text' not in dataset.column_names:
def create_text_column(example):
# Simple concatenation, adjust based on actual Dolly format if needed
instr = example.get('instruction', '')
resp = example.get('response', '')
ctx = example.get('context', '')
if ctx:
return {"text": f"### Instruction:\n{instr}\n\n### Context:\n{ctx}\n\n### Response:\n{resp}"}
else:
return {"text": f"### Instruction:\n{instr}\n\n### Response:\n{resp}"}
dataset = dataset.map(create_text_column, remove_columns=dataset.column_names)
print("Sample dataset entry:")
print(dataset[0]['text'])
We define the training parameters using transformers.TrainingArguments
. These control aspects like the learning rate, batch size, number of training epochs, saving frequency, and logging.
# Define output directory for saving checkpoints and final model
output_dir = "./sft_model_output"
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=2, # Adjust based on GPU memory
gradient_accumulation_steps=4, # Effective batch size = batch_size * grad_accum
learning_rate=2e-5,
logging_steps=20, # Log metrics every 20 steps
num_train_epochs=1, # Number of passes through the dataset
max_steps=-1, # Set to >0 to override epochs
save_strategy="epoch", # Save checkpoint at the end of each epoch
# save_steps=50, # Or save every N steps
report_to="tensorboard", # Or "wandb", "none"
fp16=True, # Use mixed precision (requires compatible GPU)
# bf16=True, # Use BF16 (requires Ampere+ GPU) - Choose one
optim="paged_adamw_8bit", # Memory-efficient optimizer, esp. with quantization
lr_scheduler_type="cosine", # Learning rate schedule
warmup_ratio=0.03, # Warmup steps for scheduler
)
Note: Hyperparameters like learning rate, batch size, and number of epochs are significant. Finding optimal values often requires experimentation based on your specific model, dataset, and hardware. The values above are illustrative.
Now we instantiate the SFTTrainer
from the trl
library. It orchestrates the fine-tuning process, handling data collation, tokenization padding, and the training loop itself.
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset,
dataset_text_field="text", # The column name containing formatted prompt/response
max_seq_length=512, # Maximum sequence length for tokenization
packing=False, # Optional: pack multiple short examples into one sequence
# Set packing=True for potential speedup on datasets
# with many short sequences. Requires careful dataset prep.
)
With everything configured, starting the training is straightforward. The trainer will handle the optimization loop, gradient updates, logging, and saving checkpoints according to the TrainingArguments
.
print("Starting SFT training...")
trainer.train()
print("Training finished.")
During training, monitor the logs (e.g., in the console or TensorBoard/WandB if configured) to observe the training loss. A decreasing loss generally indicates that the model is learning from the SFT dataset.
After training completes, save the fine-tuned model weights and tokenizer configuration. This saved model is the output of the SFT phase.
# Define path for the final saved model
final_model_path = os.path.join(output_dir, "final_sft_model")
print(f"Saving final SFT model to {final_model_path}")
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path) # Save tokenizer along with the model
print("Model saved successfully.")
Let's perform a simple test to see how the fine-tuned model responds compared to the base model (qualitatively). We load our saved SFT model and generate text for a sample prompt.
from transformers import pipeline
# Load the fine-tuned model
print("Loading fine-tuned model for inference...")
sft_pipe = pipeline("text-generation", model=final_model_path, tokenizer=final_model_path, device_map="auto")
# Define a sample prompt using the format the model expects
# (Match the format used in your SFT dataset)
prompt_text = "### Instruction:\nExplain the main benefit of using version control systems like Git.\n\n### Response:\n"
print(f"\nGenerating response for prompt:\n{prompt_text}")
# Generate response
# Adjust generation parameters as needed
output = sft_pipe(prompt_text, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.7)
print("\nGenerated Response:")
print(output[0]['generated_text'])
# Clean up GPU memory if needed
del sft_pipe
del model
# torch.cuda.empty_cache() # Uncomment if running into OOM issues later
Examine the generated response. Does it follow the instruction? Is the style consistent with the examples in your SFT dataset? This quick check provides an initial assessment of the SFT process outcome. More rigorous evaluation would involve metrics and potentially human assessment, as discussed in the previous section ("Evaluating SFT Model Performance") and later in the course (Chapter 7).
You have now successfully executed the Supervised Fine-Tuning phase. The resulting model (final_sft_model
in our example) has been adapted to better understand the desired task structure and output style based on the provided demonstrations. This fine-tuned model is now ready to serve as the initial policy network for the next stages of the RLHF pipeline: training a reward model and performing reinforcement learning optimization.
© 2025 ApX Machine Learning