Now that we have explored the theoretical underpinnings of LoRA and QLoRA in the preceding sections, let's transition to practical implementation. This hands-on exercise will guide you through fine-tuning a large language model using both LoRA and QLoRA techniques. You will gain direct experience with configuring these methods, executing the training process, and observing the efficiency gains compared to full fine-tuning. We assume you are operating in an environment with appropriate GPU resources and have installed the necessary libraries, such as transformers
, peft
, accelerate
, datasets
, and bitsandbytes
.
First, ensure your environment is correctly configured. The peft
, accelerate
, and bitsandbytes
libraries are central to implementing LoRA and, particularly, QLoRA.
We will start by loading a pre-trained base model. For this exercise, let's consider a model like meta-llama/Llama-2-7b-hf
or a similarly sized transformer. Accessing certain models may require authentication with platforms like the Hugging Face Hub.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import transformers
# Define the base model ID
model_id = "meta-llama/Llama-2-7b-hf" # Or another suitable model
# Use authentication token if required for the model
# from huggingface_hub import login
# login()
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Set pad token if missing
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load a sample dataset (e.g., instruction tuning)
# Replace 'databricks/databricks-dolly-15k' with your target dataset
data = load_dataset("databricks/databricks-dolly-15k", split="train[:1000]") # Using a subset for speed
# Preprocess the data
def format_instruction(sample):
# Adjust formatting based on the chosen dataset structure
return f"""### Instruction:
{sample['instruction']}
### Context:
{sample['context']}
### Response:
{sample['response']}
"""
data = data.map(lambda sample: tokenizer(format_instruction(sample), truncation=True, max_length=512, padding="max_length"))
print("Environment setup and data preparation complete.")
This initial setup loads the tokenizer and a sample dataset, preparing it for the fine-tuning process. The specific dataset and preprocessing function (format_instruction
) should be adapted based on your target task (e.g., summarization, question answering, instruction following).
LoRA adapts the model by introducing low-rank matrices into specified layers, typically the attention mechanism's linear projections. We configure this using LoraConfig
.
The LoraConfig
object defines how LoRA is applied:
r
: The rank of the update matrices. A smaller r
means fewer trainable parameters. Common values range from 8 to 64.lora_alpha
: A scaling factor for the LoRA updates, often set to 2 * r
.target_modules
: A list of module names within the base model where LoRA matrices will be injected (e.g., ['q_proj', 'v_proj']
for query and value projections in attention).lora_dropout
: Dropout probability applied to the LoRA layers.bias
: Specifies how biases are handled ('none', 'all', or 'lora_only'). Typically set to 'none'.task_type
: The type of task (e.g., "CAUSAL_LM").# Load the base model (ensure sufficient VRAM)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto", # Automatically distribute across available GPUs
torch_dtype=torch.float16 # Use float16 for reduced memory
)
# Define LoRA configuration
lora_config = LoraConfig(
r=16, # Rank of the update matrices
lora_alpha=32, # Scaling factor alpha
target_modules=["q_proj", "v_proj"], # Apply LoRA to query and value projections
lora_dropout=0.05, # Dropout probability
bias="none", # Do not train biases
task_type="CAUSAL_LM" # Task type
)
# Wrap the base model with PeftModel
model = get_peft_model(model, lora_config)
# Print trainable parameters percentage
model.print_trainable_parameters()
# Example output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622
The print_trainable_parameters
method provides immediate feedback on the parameter efficiency achieved. Notice the small percentage of trainable parameters compared to the total number.
We can now proceed with training using the transformers.Trainer
. It automatically handles the PEFT model, ensuring only the LoRA parameters are updated.
# Define training arguments
training_args = transformers.TrainingArguments(
output_dir="./lora_finetuned_model",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=1, # Adjust epochs based on dataset size and convergence
logging_steps=10,
save_steps=50,
fp16=True, # Use mixed precision training
# Add other relevant arguments like evaluation strategy, weight decay, etc.
)
# Initialize the Trainer
trainer = transformers.Trainer(
model=model,
args=training_args,
train_dataset=data,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
# Start training
print("Starting LoRA fine-tuning...")
trainer.train()
print("LoRA fine-tuning finished.")
# Save the trained LoRA adapter
lora_adapter_path = "./lora_adapter"
model.save_pretrained(lora_adapter_path)
print(f"LoRA adapter saved to {lora_adapter_path}")
To perform inference, load the original base model and then apply the saved LoRA adapter weights.
from peft import PeftModel
# Load the base model again (if not already in memory)
base_model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Load the LoRA adapter
model_with_adapter = PeftModel.from_pretrained(base_model, lora_adapter_path)
model_with_adapter.eval() # Set model to evaluation mode
# Example inference
prompt = "### Instruction:\nWhat are the main benefits of LoRA?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model_with_adapter.device)
with torch.no_grad():
outputs = model_with_adapter.generate(**inputs, max_new_tokens=100)
print("Generated response (LoRA):")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
QLoRA builds upon LoRA by quantizing the base model to 4-bit precision using bitsandbytes
. This dramatically reduces the memory footprint during both loading and fine-tuning, making it possible to fine-tune much larger models on consumer-grade hardware.
The primary difference lies in how the base model is loaded. We use BitsAndBytesConfig
to specify the 4-bit quantization parameters.
# Define quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Enable 4-bit quantization
bnb_4bit_quant_type="nf4", # Use NF4 (Normal Float 4) data type
bnb_4bit_compute_dtype=torch.bfloat16, # Compute dtype for faster training
bnb_4bit_use_double_quant=True, # Use double quantization for extra memory savings
)
# Load the base model with quantization config
qlora_model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto", # Important for distributing quantized models
)
# Prepare the quantized model for k-bit training
qlora_model = prepare_model_for_kbit_training(qlora_model)
# Define LoRA configuration (can be the same as before)
qlora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"], # May need adjustment based on model architecture and observed stability
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Wrap the quantized model with PeftModel
qlora_model = get_peft_model(qlora_model, qlora_config)
# Print trainable parameters
qlora_model.print_trainable_parameters()
The prepare_model_for_kbit_training
function performs necessary adjustments, like casting layer norms and the language model head to float32 for stability. The LoraConfig
remains similar, but it's now applied to the 4-bit quantized base model.
The training and inference procedures are identical to those used for LoRA. You can reuse the same transformers.Trainer
setup and inference code. The key benefit is the significantly lower memory consumption during the trainer.train()
call.
# Reuse or redefine training arguments
qlora_training_args = transformers.TrainingArguments(
output_dir="./qlora_finetuned_model",
per_device_train_batch_size=4, # Potentially increase batch size due to lower memory usage
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=1,
logging_steps=10,
save_steps=50,
fp16=False, # QLoRA uses bf16 compute type specified in BitsAndBytesConfig
bf16=True, # Enable bf16 training
# Add other relevant arguments
)
# Initialize the Trainer for QLoRA
qlora_trainer = transformers.Trainer(
model=qlora_model,
args=qlora_training_args,
train_dataset=data,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
# Start QLoRA training
print("Starting QLoRA fine-tuning...")
qlora_trainer.train()
print("QLoRA fine-tuning finished.")
# Save the QLoRA adapter
qlora_adapter_path = "./qlora_adapter"
qlora_model.save_pretrained(qlora_adapter_path)
print(f"QLoRA adapter saved to {qlora_adapter_path}")
# Inference follows the same pattern as LoRA, loading the adapter onto the quantized base model
# Ensure the base model is loaded with the same BitsAndBytesConfig used for training
After training adapters using both methods, a rigorous evaluation is necessary.
print_trainable_parameters()
for both configurations. They should be identical if the LoraConfig
is the same, highlighting that QLoRA's memory saving comes from base model quantization, not fewer adapter parameters.Consider visualizing these trade-offs:
Illustrative comparison showing potential VRAM usage and trainable parameter percentage for Full Fine-Tuning, LoRA, and QLoRA. Actual values depend heavily on the model size, hardware, and batch size. Note that Full Fine-Tuning VRAM exceeds the chart range here for emphasis.
This practical exercise demonstrates the implementation of LoRA and QLoRA for efficient LLM fine-tuning.
When choosing between them, consider the available hardware resources and the required performance fidelity. If memory is the primary constraint, QLoRA is an excellent option. If maximum performance is needed and memory allows, standard LoRA (or even full fine-tuning if resources permit) might be preferred. Hyperparameter tuning, particularly for r
, lora_alpha
, and the learning rate, remains important for optimizing results with both techniques.
© 2025 ApX Machine Learning