All Courses

Hands-on Practical: Fine-tuning with LoRA

Let's put the theory of Low-Rank Adaptation (LoRA) into practice. In this hands-on section, we will fine-tune a pre-trained transformer model on a downstream task using the peft library from Hugging Face. This approach significantly reduces the number of trainable parameters compared to full fine-tuning, making the process faster and requiring less memory, without substantial compromises in performance for many tasks.

We will walk through the essential steps: setting up the environment, preparing the data, configuring LoRA, training the adapter, and performing inference using the fine-tuned model.

1. Setup and Environment

First, ensure you have the necessary libraries installed. We'll primarily use transformers for the base model and training utilities, peft for implementing LoRA, datasets for data handling, and accelerate to simplify running PyTorch code on any infrastructure.

pip install transformers datasets peft accelerate torch

Now, let's import the required modules and define our base model checkpoint. For this example, we'll use a relatively small sequence-to-sequence model like google/flan-t5-small and fine-tune it on a summarization task. Using a smaller model makes the process quicker and accessible even without high-end GPUs.

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training

# Define the base model checkpoint
model_checkpoint = "google/flan-t5-small"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Load the base model. We use device_map="auto" to leverage accelerate for placing layers across devices.
# We also load in 8-bit for further memory saving, compatible with LoRA.
# Note: 8-bit loading is optional but useful for larger models.
# If not using 8-bit, remove load_in_8bit and prepare_model_for_kbit_training
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, load_in_8bit=True, device_map="auto")

# Prepare the model for k-bit training (if using quantization)
# This step is needed when loading models in 8-bit or 4-bit
model = prepare_model_for_kbit_training(model)

2. Data Preparation

We need a dataset suitable for our chosen task: summarization. The samsum dataset, containing dialogues and their summaries, is a good choice. We'll load it using the datasets library and preprocess it. For efficiency, we'll only use a small fraction of the dataset for this demonstration.

# Load the dataset
dataset_name = "samsum"
dataset = load_dataset(dataset_name, split="train[:1%]") # Using only 1% for demo
dataset = dataset.train_test_split(test_size=0.1) # Create train/test splits

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
# Example: Train dataset size: 132
# Example: Test dataset size: 15

# Preprocessing function
max_input_length = 512
max_target_length = 128

def preprocess_function(examples):
    # Add prefix for T5 models
    inputs = ["summarize: " + doc for doc in examples["dialogue"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length")

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    # Replace tokenizer.pad_token_id in the labels by -100 to ignore padding in the loss calculation
    model_inputs["labels"] = [
        [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in model_inputs["labels"]
    ]
    return model_inputs

# Apply preprocessing
tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Remove columns not needed for training
tokenized_datasets = tokenized_datasets.remove_columns(["id", "dialogue", "summary"])

print(f"Columns in tokenized dataset: {tokenized_datasets['train'].column_names}")
# Example: Columns in tokenized dataset: ['input_ids', 'attention_mask', 'labels']

We also need a data collator to handle dynamic padding during batch creation. DataCollatorForSeq2Seq is suitable for sequence-to-sequence tasks.

# Create data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=-100, # Important: ensure labels are padded correctly
    pad_to_multiple_of=8 # Optional: optimizes hardware usage
)

3. LoRA Configuration

This is where we define how LoRA will modify the base model. We use the LoraConfig class from the peft library.

r: The rank of the low-rank matrices ( $A$ and $B$ ). A smaller r means fewer trainable parameters but might capture less task-specific information. Common values range from 4 to 32.
lora_alpha: The scaling factor for the LoRA updates. It's often set equal to r or 2*r. The update is scaled by $\frac{\alpha}{r}$ .
target_modules: A list of module names within the base model where the LoRA matrices will be injected. For T5 models, targeting the query (q) and value (v) projections in the self-attention mechanism is standard practice. You can find these names by inspecting model.named_modules().
lora_dropout: Dropout applied to the LoRA layers.
bias: Specifies which biases to train. "none" is common, freezing all original biases and not adding new ones.
task_type: Defines the model type and task. For flan-t5, it's TaskType.SEQ_2_SEQ_LM.

# Define LoRA configuration
lora_config = LoraConfig(
    r=16, # Rank of the update matrices
    lora_alpha=32, # Scaling factor
    target_modules=["q", "v"], # Apply LoRA to query and value projections
    lora_dropout=0.05, # Dropout probability
    bias="none", # Do not train biases
    task_type=TaskType.SEQ_2_SEQ_LM # Task type for sequence-to-sequence models
)

4. Wrapping the Model with PEFT

Now, we apply the LoRA configuration to our base model using get_peft_model.

# Get the PEFT model
peft_model = get_peft_model(model, lora_config)

# Print the number of trainable parameters
peft_model.print_trainable_parameters()
# Example output: trainable params: 884,736 || all params: 77,822,464 || trainable%: 1.13685...

Notice the significant reduction! We are only training around 1% of the total parameters. This drastically reduces memory requirements and speeds up training compared to updating all 77 million parameters of flan-t5-small.

Diagram illustrating how LoRA injects trainable low-rank matrices (A and B) alongside the frozen pre-trained weight matrix (W). Only A and B are updated during training.

5. Training the LoRA Adapter

We use the standard Trainer from the transformers library. The setup is almost identical to full fine-tuning, but the Trainer will automatically handle the PEFT model, only updating the LoRA parameters.

# Define Training Arguments
output_dir = "flan-t5-small-samsum-lora"
training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True, # Automatically find a suitable batch size
    learning_rate=1e-3, # Higher learning rate typical for LoRA
    num_train_epochs=3, # Number of training epochs
    logging_strategy="epoch", # Log metrics every epoch
    save_strategy="epoch", # Save checkpoint every epoch
    # evaluation_strategy="epoch", # Evaluate every epoch if eval data is available
    report_to="none", # Disable reporting to wandb/tensorboard for this example
    # Use fp16 for faster training if supported
    # fp16=torch.cuda.is_available(),
)

# Create Trainer instance
trainer = Trainer(
    model=peft_model, # Pass the PEFT model
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"], # Optional: Pass eval dataset
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Set LoRA layers to trainable explicitly (sometimes needed)
peft_model.config.use_cache = False # Disable caching for training

# Start training
print("Starting LoRA training...")
trainer.train()
print("Training finished.")

Training should be significantly faster and require less GPU memory than full fine-tuning the flan-t5-small model.

6. Saving the Adapter

After training, we save the trained LoRA adapter weights. Importantly, this saves only the adapter parameters (matrices A and B for each targeted module), not the entire base model. This makes the saved artifact very small.

# Define path to save the adapter
adapter_path = f"{output_dir}/final_adapter"

# Save the adapter weights
peft_model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path) # Save tokenizer alongside adapter

print(f"LoRA adapter saved to: {adapter_path}")

# You can check the size of the saved adapter - it should be relatively small (MBs).
# For example, using: !ls -lh {adapter_path}

7. Inference with the LoRA Adapter

To use the fine-tuned model for inference, we first load the original base model and then load the LoRA adapter weights on top of it.

from peft import PeftModel, PeftConfig

# Load the base model again (if not already in memory)
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Load the PEFT model with the saved adapter
lora_model = PeftModel.from_pretrained(base_model, adapter_path)
lora_model = lora_model.to("cuda" if torch.cuda.is_available() else "cpu") # Ensure model is on correct device
lora_model.eval() # Set model to evaluation mode

# Prepare a sample input from the test set (or any new dialogue)
sample_idx = 5
dialogue = dataset['test'][sample_idx]['dialogue']
reference_summary = dataset['test'][sample_idx]['summary']

input_text = "summarize: " + dialogue
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(lora_model.device)

print("Dialogue:")
print(dialogue)
print("\nReference Summary:")
print(reference_summary)

# Generate summary using the LoRA model
with torch.no_grad():
    outputs = lora_model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9)
generated_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\nGenerated Summary (LoRA):")
print(generated_summary)

# Optional: Compare with base model generation
# base_model.to(lora_model.device)
# base_model.eval()
# with torch.no_grad():
#    base_outputs = base_model.generate(input_ids=input_ids, max_new_tokens=100)
# base_summary = tokenizer.decode(base_outputs[0], skip_special_tokens=True)
# print("\nGenerated Summary (Base Model):")
# print(base_summary)

You should observe that the summary generated by the LoRA-adapted model is more aligned with the summarization task than the output from the original base model (which might just repeat parts of the input or give irrelevant responses without fine-tuning).

8. Merging LoRA Weights (Optional)

For deployment scenarios where you don't need to switch between different adapters frequently, you can merge the LoRA weights directly into the base model's weights. This creates a standard transformers model that incorporates the fine-tuning adjustments. After merging, the peft library is no longer needed for inference.

# Merge the adapter weights into the base model
# merged_model = lora_model.merge_and_unload()

# Now 'merged_model' is a standard transformers model with the LoRA updates applied.
# It can be saved and loaded like any regular Hugging Face model.
# merged_model.save_pretrained(f"{output_dir}/final_merged_model")
# tokenizer.save_pretrained(f"{output_dir}/final_merged_model")

# Note: After merging, the model size increases back to the original base model size,
# as the low-rank updates are now part of the main weight matrices.

This practical exercise demonstrated how to apply LoRA for efficient fine-tuning. You successfully adapted a pre-trained model using significantly fewer trainable parameters, configured the LoRA parameters, trained the adapter, and performed inference. Experimenting with different ranks (r), lora_alpha values, and target_modules can help optimize performance for specific tasks and datasets.

Was this section helpful?