Let's translate the theoretical understanding of LoRA into a practical fine-tuning exercise. This section provides a hands-on walkthrough using the Hugging Face transformers
and peft
libraries, which offer convenient abstractions for applying LoRA to pre-trained models. We will load a model, configure LoRA, apply it, and prepare for training.
This practical assumes you have a working Python environment with torch
, transformers
, peft
, and datasets
installed.
pip install -q torch transformers datasets peft accelerate bitsandbytes
First, we import necessary libraries and load a pre-trained model and a dataset. For demonstration purposes, we'll use GPT-2, a common causal language model, and a small subset of the ELI5 dataset, which contains question-answer pairs.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from datasets import load_dataset
# Load Model and Tokenizer
model_name = "gpt2" # Example: Use 'gpt2' or another suitable causal LM
base_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Set pad token for batching
# Load and Prepare Dataset (using a small subset for demonstration)
dataset = load_dataset("eli5", split="train_asks[:1000]") # Use a small slice
# Basic preprocessing: Tokenize the text
def preprocess_function(examples):
# Concatenate question and answer for causal LM training
texts = [q + " " + a[0] for q, a in zip(examples['title'], examples['answers.text'])]
return tokenizer(texts, truncation=True, padding="max_length", max_length=128)
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)
print("Base model loaded:", model_name)
print("Dataset loaded and tokenized.")
The core of applying LoRA via the peft
library lies in the LoraConfig
. This object specifies how LoRA should be integrated into the base model. We need to define the rank r, the scaling factor α, and which layers LoRA should target.
As discussed previously:
r
: The rank of the decomposition BA. A smaller r means fewer trainable parameters but might limit adaptation capacity. Typical values range from 4 to 64.lora_alpha
: The scaling factor applied to the LoRA output (rαBAx). It controls the magnitude of the adaptation relative to the original weights. A common practice is setting lora_alpha
equal to or double the value of r
.target_modules
: A list of module names (or regex patterns) within the base model where LoRA matrices should replace or augment linear layers. For GPT-2, common targets are the attention projection layers (e.g., c_attn
) or feed-forward network layers. You can inspect base_model.named_modules()
to find suitable names.task_type
: Specifies the task objective, influencing how PEFT structures might interact with model heads. For GPT-2 fine-tuning, TaskType.CAUSAL_LM
is appropriate.# Define LoRA configuration
lora_config = LoraConfig(
r=16, # Rank r
lora_alpha=32, # Scaling factor alpha
target_modules=["c_attn"], # Apply LoRA to query, key, value projections in attention
lora_dropout=0.05, # Dropout probability for LoRA layers
bias="none", # Do not train bias parameters
task_type=TaskType.CAUSAL_LM # Task type for causal language modeling
)
print("LoRA Configuration:")
print(lora_config)
With the LoraConfig
defined, applying it to the base model is straightforward using get_peft_model
. This function modifies the model architecture according to the configuration, freezing the original weights and inserting the trainable LoRA adapters.
Let's compare the number of trainable parameters before and after applying LoRA.
# Calculate original trainable parameters
original_params = sum(p.numel() for p in base_model.parameters() if p.requires_grad)
print(f"Original trainable parameters: {original_params:,}")
# Apply LoRA configuration to the base model
lora_model = get_peft_model(base_model, lora_config)
# Calculate LoRA trainable parameters
lora_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
print(f"LoRA trainable parameters: {lora_params:,}")
# Calculate the reduction
reduction = (original_params - lora_params) / original_params * 100
print(f"Parameter reduction: {reduction:.2f}%")
# Print trainable modules for verification
lora_model.print_trainable_parameters()
You should observe a significant reduction (often >99%) in the number of trainable parameters. The output of print_trainable_parameters
confirms that only the newly added LoRA components (lora_A
and lora_B
) require gradients.
Comparison of trainable parameters in full fine-tuning versus LoRA fine-tuning. LoRA significantly reduces the parameter count by only training small adapter matrices while keeping the base model frozen.
Now, we can set up a standard training process using the transformers.Trainer
. The key difference is that we pass the lora_model
(the PEFT-modified model) instead of the base_model
. The Trainer
will automatically handle the optimization, focusing only on the trainable LoRA parameters.
# Define Training Arguments
output_dir = "./lora_gpt2_eli5_results"
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=1, # Keep it short for demonstration
per_device_train_batch_size=4, # Adjust based on your GPU memory
logging_steps=50,
save_steps=200,
learning_rate=2e-4, # Typical learning rate for LoRA
fp16=torch.cuda.is_available(), # Use mixed precision if available
# Add other arguments as needed: weight_decay, warmup_steps, etc.
)
# Data Collator for Causal LM
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
# Initialize Trainer
trainer = Trainer(
model=lora_model, # Use the PEFT model
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
)
print("Trainer initialized. Starting training...")
Initiate the training process. The Trainer
will handle the forward and backward passes, updating only the LoRA weights (A and B). The base model weights remain unchanged.
# Start training
trainer.train()
print("Training finished.")
During training, observe the logs. The loss should decrease, indicating that the LoRA adapters are learning to adapt the model's behavior for the target task (in this case, mimicking the question-answer style of ELI5).
After training, you don't save the entire model. Instead, you save only the trained LoRA adapter weights. This is one of the primary advantages of PEFT methods like LoRA – the resulting artifact is very small.
# Define path to save the adapter
adapter_path = f"{output_dir}/final_adapter"
# Save the LoRA adapter weights
lora_model.save_pretrained(adapter_path)
print(f"LoRA adapter saved to: {adapter_path}")
# You can verify the small size of the saved adapter directory
# !ls -lh {adapter_path}
The saved directory (adapter_path
) will contain files like adapter_model.bin
(the LoRA weights) and adapter_config.json
(the LoRA configuration used). Its size will be in the megabytes range, dramatically smaller than the gigabytes required for the full base model.
To use the fine-tuned model for inference, you load the original base model first and then apply the saved LoRA adapter weights.
# Load the base model again (or use the one already in memory)
base_model_reloaded = AutoModelForCausalLM.from_pretrained(model_name)
# Load the PEFT model by merging the adapter weights
inference_model = PeftModel.from_pretrained(base_model_reloaded, adapter_path)
# Ensure the model is in evaluation mode and on the correct device
inference_model.eval()
if torch.cuda.is_available():
inference_model.to("cuda")
print("Base model loaded and LoRA adapter applied for inference.")
# Example Inference (Optional)
prompt = "What is the main cause of climate change?"
inputs = tokenizer(prompt, return_tensors="pt")
if torch.cuda.is_available():
inputs = {k: v.to("cuda") for k, v in inputs.items()}
# Generate text
with torch.no_grad():
outputs = inference_model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n--- Example Generation ---")
print("Prompt:", prompt)
print("Generated:", generated_text)
print("------------------------")
This completes a basic cycle of applying LoRA: configuring, wrapping the model, training, saving the adapter, and loading it for use. You've effectively fine-tuned a large language model by only training a tiny fraction of its parameters, demonstrating the efficiency of LoRA. The following sections and chapters will build upon this foundation, exploring more advanced configurations, variants like QLoRA, and evaluation techniques.
© 2025 ApX Machine Learning