Let's put the theory of Low-Rank Adaptation (LoRA) into practice. In this hands-on section, we will fine-tune a pre-trained transformer model on a downstream task using the peft
library from Hugging Face. This approach significantly reduces the number of trainable parameters compared to full fine-tuning, making the process faster and requiring less memory, without substantial compromises in performance for many tasks.
We will walk through the essential steps: setting up the environment, preparing the data, configuring LoRA, training the adapter, and performing inference using the fine-tuned model.
First, ensure you have the necessary libraries installed. We'll primarily use transformers
for the base model and training utilities, peft
for implementing LoRA, datasets
for data handling, and accelerate
to simplify running PyTorch code on any infrastructure.
pip install transformers datasets peft accelerate torch
Now, let's import the required modules and define our base model checkpoint. For this example, we'll use a relatively small sequence-to-sequence model like google/flan-t5-small
and fine-tune it on a summarization task. Using a smaller model makes the process quicker and accessible even without high-end GPUs.
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
# Define the base model checkpoint
model_checkpoint = "google/flan-t5-small"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Load the base model. We use device_map="auto" to leverage accelerate for placing layers across devices.
# We also load in 8-bit for further memory saving, compatible with LoRA.
# Note: 8-bit loading is optional but useful for larger models.
# If not using 8-bit, remove load_in_8bit and prepare_model_for_kbit_training
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, load_in_8bit=True, device_map="auto")
# Prepare the model for k-bit training (if using quantization)
# This step is needed when loading models in 8-bit or 4-bit
model = prepare_model_for_kbit_training(model)
We need a dataset suitable for our chosen task: summarization. The samsum
dataset, containing dialogues and their summaries, is a good choice. We'll load it using the datasets
library and preprocess it. For efficiency, we'll only use a small fraction of the dataset for this demonstration.
# Load the dataset
dataset_name = "samsum"
dataset = load_dataset(dataset_name, split="train[:1%]") # Using only 1% for demo
dataset = dataset.train_test_split(test_size=0.1) # Create train/test splits
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
# Example: Train dataset size: 132
# Example: Test dataset size: 15
# Preprocessing function
max_input_length = 512
max_target_length = 128
def preprocess_function(examples):
# Add prefix for T5 models
inputs = ["summarize: " + doc for doc in examples["dialogue"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length")
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
# Replace tokenizer.pad_token_id in the labels by -100 to ignore padding in the loss calculation
model_inputs["labels"] = [
[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in model_inputs["labels"]
]
return model_inputs
# Apply preprocessing
tokenized_datasets = dataset.map(preprocess_function, batched=True)
# Remove columns not needed for training
tokenized_datasets = tokenized_datasets.remove_columns(["id", "dialogue", "summary"])
print(f"Columns in tokenized dataset: {tokenized_datasets['train'].column_names}")
# Example: Columns in tokenized dataset: ['input_ids', 'attention_mask', 'labels']
We also need a data collator to handle dynamic padding during batch creation. DataCollatorForSeq2Seq
is suitable for sequence-to-sequence tasks.
# Create data collator
data_collator = DataCollatorForSeq2Seq(
tokenizer,
model=model,
label_pad_token_id=-100, # Important: ensure labels are padded correctly
pad_to_multiple_of=8 # Optional: optimizes hardware usage
)
This is where we define how LoRA will modify the base model. We use the LoraConfig
class from the peft
library.
r
: The rank of the low-rank matrices (A and B). A smaller r
means fewer trainable parameters but might capture less task-specific information. Common values range from 4 to 32.lora_alpha
: The scaling factor for the LoRA updates. It's often set equal to r
or 2*r
. The update is scaled by rα.target_modules
: A list of module names within the base model where the LoRA matrices will be injected. For T5 models, targeting the query (q
) and value (v
) projections in the self-attention mechanism is standard practice. You can find these names by inspecting model.named_modules()
.lora_dropout
: Dropout applied to the LoRA layers.bias
: Specifies which biases to train. "none"
is common, freezing all original biases and not adding new ones.task_type
: Defines the model type and task. For flan-t5
, it's TaskType.SEQ_2_SEQ_LM
.# Define LoRA configuration
lora_config = LoraConfig(
r=16, # Rank of the update matrices
lora_alpha=32, # Scaling factor
target_modules=["q", "v"], # Apply LoRA to query and value projections
lora_dropout=0.05, # Dropout probability
bias="none", # Do not train biases
task_type=TaskType.SEQ_2_SEQ_LM # Task type for sequence-to-sequence models
)
Now, we apply the LoRA configuration to our base model using get_peft_model
.
# Get the PEFT model
peft_model = get_peft_model(model, lora_config)
# Print the number of trainable parameters
peft_model.print_trainable_parameters()
# Example output: trainable params: 884,736 || all params: 77,822,464 || trainable%: 1.13685...
Notice the significant reduction! We are only training around 1% of the total parameters. This drastically reduces memory requirements and speeds up training compared to updating all 77 million parameters of flan-t5-small
.
Diagram illustrating how LoRA injects trainable low-rank matrices (A and B) alongside the frozen pre-trained weight matrix (W). Only A and B are updated during training.
We use the standard Trainer
from the transformers
library. The setup is almost identical to full fine-tuning, but the Trainer
will automatically handle the PEFT model, only updating the LoRA parameters.
# Define Training Arguments
output_dir = "flan-t5-small-samsum-lora"
training_args = TrainingArguments(
output_dir=output_dir,
auto_find_batch_size=True, # Automatically find a suitable batch size
learning_rate=1e-3, # Higher learning rate typical for LoRA
num_train_epochs=3, # Number of training epochs
logging_strategy="epoch", # Log metrics every epoch
save_strategy="epoch", # Save checkpoint every epoch
# evaluation_strategy="epoch", # Evaluate every epoch if eval data is available
report_to="none", # Disable reporting to wandb/tensorboard for this example
# Use fp16 for faster training if supported
# fp16=torch.cuda.is_available(),
)
# Create Trainer instance
trainer = Trainer(
model=peft_model, # Pass the PEFT model
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"], # Optional: Pass eval dataset
data_collator=data_collator,
tokenizer=tokenizer,
)
# Set LoRA layers to trainable explicitly (sometimes needed)
peft_model.config.use_cache = False # Disable caching for training
# Start training
print("Starting LoRA training...")
trainer.train()
print("Training finished.")
Training should be significantly faster and require less GPU memory than full fine-tuning the flan-t5-small
model.
After training, we save the trained LoRA adapter weights. Importantly, this saves only the adapter parameters (matrices A and B for each targeted module), not the entire base model. This makes the saved artifact very small.
# Define path to save the adapter
adapter_path = f"{output_dir}/final_adapter"
# Save the adapter weights
peft_model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path) # Save tokenizer alongside adapter
print(f"LoRA adapter saved to: {adapter_path}")
# You can check the size of the saved adapter - it should be relatively small (MBs).
# For example, using: !ls -lh {adapter_path}
To use the fine-tuned model for inference, we first load the original base model and then load the LoRA adapter weights on top of it.
from peft import PeftModel, PeftConfig
# Load the base model again (if not already in memory)
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Load the PEFT model with the saved adapter
lora_model = PeftModel.from_pretrained(base_model, adapter_path)
lora_model = lora_model.to("cuda" if torch.cuda.is_available() else "cpu") # Ensure model is on correct device
lora_model.eval() # Set model to evaluation mode
# Prepare a sample input from the test set (or any new dialogue)
sample_idx = 5
dialogue = dataset['test'][sample_idx]['dialogue']
reference_summary = dataset['test'][sample_idx]['summary']
input_text = "summarize: " + dialogue
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(lora_model.device)
print("Dialogue:")
print(dialogue)
print("\nReference Summary:")
print(reference_summary)
# Generate summary using the LoRA model
with torch.no_grad():
outputs = lora_model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9)
generated_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nGenerated Summary (LoRA):")
print(generated_summary)
# Optional: Compare with base model generation
# base_model.to(lora_model.device)
# base_model.eval()
# with torch.no_grad():
# base_outputs = base_model.generate(input_ids=input_ids, max_new_tokens=100)
# base_summary = tokenizer.decode(base_outputs[0], skip_special_tokens=True)
# print("\nGenerated Summary (Base Model):")
# print(base_summary)
You should observe that the summary generated by the LoRA-adapted model is more aligned with the summarization task than the output from the original base model (which might just repeat parts of the input or give irrelevant responses without fine-tuning).
For deployment scenarios where you don't need to switch between different adapters frequently, you can merge the LoRA weights directly into the base model's weights. This creates a standard transformers
model that incorporates the fine-tuning adjustments. After merging, the peft
library is no longer needed for inference.
# Merge the adapter weights into the base model
# merged_model = lora_model.merge_and_unload()
# Now 'merged_model' is a standard transformers model with the LoRA updates applied.
# It can be saved and loaded like any regular Hugging Face model.
# merged_model.save_pretrained(f"{output_dir}/final_merged_model")
# tokenizer.save_pretrained(f"{output_dir}/final_merged_model")
# Note: After merging, the model size increases back to the original base model size,
# as the low-rank updates are now part of the main weight matrices.
This practical exercise demonstrated how to apply LoRA for efficient fine-tuning. You successfully adapted a pre-trained model using significantly fewer trainable parameters, configured the LoRA parameters, trained the adapter, and performed inference. Experimenting with different ranks (r
), lora_alpha
values, and target_modules
can help optimize performance for specific tasks and datasets.
© 2025 ApX Machine Learning