Alright, let's put theory into practice. Having discussed the importance of Supervised Fine-Tuning (SFT) for initializing our policy model, we will now walk through the steps to actually perform SFT on a pre-trained language model. This practical exercise uses libraries from the Hugging Face ecosystem, specifically transformers, datasets, and trl (Transformer Reinforcement Learning), which significantly simplifies the process.Our goal is to take a general-purpose Large Language Model (LLM) and fine-tune it on a dataset of high-quality prompt-response pairs. This adapted model will then serve as a better starting point for the subsequent reinforcement learning phase.Setting the Stage: Imports and ConfigurationFirst, ensure you have the necessary libraries installed (pip install transformers datasets trl torch accelerate bitsandbytes). We begin by importing the required components. We will use PyTorch in these examples.import torch from datasets import load_dataset from transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig # Optional: for quantization ) from trl import SFTTrainer import os # Optional: Configure GPU usage if available # os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Set to your desired GPU IDStep 1: Load the Base Model and Tokenizer"We need a pre-trained language model to fine-tune. For demonstration purposes, we'll use a smaller, publicly available model like gpt2. In a practical scenario, you would likely use a larger, more capable model that matches your target application's requirements. We also load the corresponding tokenizer, which is responsible for converting text into numerical representations the model understands."To manage memory, especially with larger models, we can use quantization techniques like bitsandbytes. This is optional but often helpful.# Model identifier from Hugging Face Hub base_model_id = "gpt2" # Replace with your model, e.g., "NousResearch/Llama-2-7b-hf" # Optional: Quantization configuration # bnb_config = BitsAndBytesConfig( # load_in_4bit=True, # bnb_4bit_quant_type="nf4", # bnb_4bit_compute_dtype=torch.bfloat16 # ) # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True) # Set pad token if not set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token # Load the model model = AutoModelForCausalLM.from_pretrained( base_model_id, # quantization_config=bnb_config, # Uncomment if using quantization device_map="auto", # Automatically loads model across available GPUs/CPU trust_remote_code=True ) model.config.use_cache = False # Recommended for fine-tuningStep 2: Prepare the DatasetSFT requires a dataset where each example represents a high-quality interaction. The trl library's SFTTrainer is flexible, but a common format is a dataset with a column (e.g., named 'text') containing the full prompt and response, often structured with special tokens.Let's assume we have a dataset (perhaps loaded from Hugging Face Hub or a local file) where each entry looks like this:{"text": "### Prompt: Explain the concept of photosynthesis in simple terms.\n### Response: Photosynthesis is the process plants use to turn sunlight, water, and air into food (sugar) for themselves, releasing oxygen as a byproduct."} {"text": "### Prompt: Write a short poem about a rainy day.\n### Response: Silver drops on window pane,\nWhispering secrets of the rain.\nGray clouds drift in skies above,\nA cozy day for hearth and love."}You can load datasets using the datasets library. For this example, let's simulate loading a fictional dataset compatible with SFTTrainer. A popular choice for demonstration is the databricks/databricks-dolly-15k dataset, filtered for instruction-following examples.# Example using a public dataset (requires 'pip install pyarrow') # Make sure dataset has a 'text' column or adapt using formatting_func dataset_name = "databricks/databricks-dolly-15k" # For demonstration, we'll just use a small part of the training set dataset = load_dataset(dataset_name, split="train[:500]") # Using first 500 examples # --- Simple Data Preprocessing Example (if needed) --- # If your dataset columns are named differently (e.g., 'instruction', 'response'), # you might need to format them into a single 'text' field. # def format_dolly(example): # # Adjust formatting based on your specific dataset structure # if example.get("context"): # return f"### Instruction:\n{example['instruction']}\n\n### Context:\n{example['context']}\n\n### Response:\n{example['response']}" # else: # return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}" # Apply formatting if the dataset doesn't have a 'text' field ready # dataset = dataset.map(lambda x: {"text": format_dolly(x)}) # -------------------------------------------------------- # We need to rename the columns for the Dolly dataset to fit SFTTrainer expectation # if 'text' column doesn't exist. Let's assume Dolly has 'instruction' and 'response'. # SFTTrainer can often handle this directly if specified. # We'll demonstrate renaming for clarity if needed. # If dataset already has 'text', skip renaming. # Check if 'text' column exists, otherwise try to create it if 'text' not in dataset.column_names: def create_text_column(example): # Simple concatenation, adjust based on actual Dolly format if needed instr = example.get('instruction', '') resp = example.get('response', '') ctx = example.get('context', '') if ctx: return {"text": f"### Instruction:\n{instr}\n\n### Context:\n{ctx}\n\n### Response:\n{resp}"} else: return {"text": f"### Instruction:\n{instr}\n\n### Response:\n{resp}"} dataset = dataset.map(create_text_column, remove_columns=dataset.column_names) print("Sample dataset entry:") print(dataset[0]['text'])Step 3: Configure Training ArgumentsWe define the training parameters using transformers.TrainingArguments. These control aspects like the learning rate, batch size, number of training epochs, saving frequency, and logging.# Define output directory for saving checkpoints and final model output_dir = "./sft_model_output" training_args = TrainingArguments( output_dir=output_dir, per_device_train_batch_size=2, # Adjust based on GPU memory gradient_accumulation_steps=4, # Effective batch size = batch_size * grad_accum learning_rate=2e-5, logging_steps=20, # Log metrics every 20 steps num_train_epochs=1, # Number of passes through the dataset max_steps=-1, # Set to >0 to override epochs save_strategy="epoch", # Save checkpoint at the end of each epoch # save_steps=50, # Or save every N steps report_to="tensorboard", # Or "wandb", "none" fp16=True, # Use mixed precision (requires compatible GPU) # bf16=True, # Use BF16 (requires Ampere+ GPU) - Choose one optim="paged_adamw_8bit", # Memory-efficient optimizer, esp. with quantization lr_scheduler_type="cosine", # Learning rate schedule warmup_ratio=0.03, # Warmup steps for scheduler )Note: Hyperparameters like learning rate, batch size, and number of epochs are significant. Finding optimal values often requires experimentation based on your specific model, dataset, and hardware. The values above are illustrative.Step 4: Initialize the SFT TrainerNow we instantiate the SFTTrainer from the trl library. It orchestrates the fine-tuning process, handling data collation, tokenization padding, and the training loop itself.trainer = SFTTrainer( model=model, tokenizer=tokenizer, args=training_args, train_dataset=dataset, dataset_text_field="text", # The column name containing formatted prompt/response max_seq_length=512, # Maximum sequence length for tokenization packing=False, # Optional: pack multiple short examples into one sequence # Set packing=True for potential speedup on datasets # with many short sequences. Requires careful dataset prep. )Step 5: Start Fine-TuningWith everything configured, starting the training is straightforward. The trainer will handle the optimization loop, gradient updates, logging, and saving checkpoints according to the TrainingArguments.print("Starting SFT training...") trainer.train() print("Training finished.")During training, monitor the logs (e.g., in the console or TensorBoard/WandB if configured) to observe the training loss. A decreasing loss generally indicates that the model is learning from the SFT dataset.Step 6: Save the Final ModelAfter training completes, save the fine-tuned model weights and tokenizer configuration. This saved model is the output of the SFT phase.# Define path for the final saved model final_model_path = os.path.join(output_dir, "final_sft_model") print(f"Saving final SFT model to {final_model_path}") trainer.save_model(final_model_path) tokenizer.save_pretrained(final_model_path) # Save tokenizer along with the model print("Model saved successfully.")Step 7: Quick Inference CheckLet's perform a simple test to see how the fine-tuned model responds compared to the base model (qualitatively). We load our saved SFT model and generate text for a sample prompt.from transformers import pipeline # Load the fine-tuned model print("Loading fine-tuned model for inference...") sft_pipe = pipeline("text-generation", model=final_model_path, tokenizer=final_model_path, device_map="auto") # Define a sample prompt using the format the model expects # (Match the format used in your SFT dataset) prompt_text = "### Instruction:\nExplain the main benefit of using version control systems like Git.\n\n### Response:\n" print(f"\nGenerating response for prompt:\n{prompt_text}") # Generate response # Adjust generation parameters as needed output = sft_pipe(prompt_text, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.7) print("\nGenerated Response:") print(output[0]['generated_text']) # Clean up GPU memory if needed del sft_pipe del model # torch.cuda.empty_cache() # Uncomment if running into OOM issues laterExamine the generated response. Does it follow the instruction? Is the style consistent with the examples in your SFT dataset? This quick check provides an initial assessment of the SFT process outcome. More rigorous evaluation would involve metrics and potentially human assessment, as discussed in the previous section ("Evaluating SFT Model Performance") and later in the course (Chapter 7).Moving ForwardYou have now successfully executed the Supervised Fine-Tuning phase. The resulting model (final_sft_model in our example) has been adapted to better understand the desired task structure and output style based on the provided demonstrations. This fine-tuned model is now ready to serve as the initial policy network for the next stages of the RLHF pipeline: training a reward model and performing reinforcement learning optimization.