Fine-tuning a large language model using Low-Rank Adaptation (LoRA) involves applying Parameter-Efficient Fine-Tuning (PEFT) techniques in a practical setting. The Hugging Face ecosystem, including the transformers, peft, and accelerate libraries, is utilized to perform this task efficiently. The objective is to adapt a pre-trained model to follow instructions more effectively, all while operating within the memory constraints of a typical consumer-grade GPU.
First, ensure the necessary libraries are installed. We will need transformers for model handling, peft for LoRA implementation, accelerate to simplify running PyTorch on any infrastructure, datasets for data loading, and bitsandbytes to enable model quantization.
pip install transformers peft accelerate datasets bitsandbytes
These libraries form the foundation of our fine-tuning workflow, providing the tools to load models, apply adapters, and manage the training process.
The starting point for PEFT is a capable base model. For this exercise, we will use a moderately-sized model and load it in 4-bit precision using the bitsandbytes library. This quantization step dramatically reduces the GPU memory required to hold the model, making it possible to fine-tune billion-parameter models on hardware with limited VRAM.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
# Configure quantization to load the model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
# Load the model with quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True
)
model.config.use_cache = False # Disable cache for training
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Set padding token
By setting load_in_4bit=True, we instruct transformers to apply quantization as the model is loaded. The base model's weights are now frozen in a memory-efficient 4-bit format, ready for LoRA adapters to be attached.
For instruction fine-tuning, the model learns to generate a response given an instruction. We will use a subset of the databricks/dolly-v2-12k dataset, which contains instruction-response pairs. We need to format each data point into a single text string, often using a prompt template. This template structures the input, clearly separating the instruction from the space for the model's response.
Let's define a simple prompt template and a formatting function.
from datasets import load_dataset
# Load a sample of the dataset
dataset = load_dataset("databricks/dolly-v2-12k", split="train[:1000]")
# Function to format the prompts
def format_prompt(sample):
instruction = sample["instruction"]
context = sample["context"]
response = sample["response"]
if context:
prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{context}
### Response:
{response}"""
else:
prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
{response}"""
return {"text": prompt}
# Apply the formatting
formatted_dataset = dataset.map(format_prompt)
This formatted text will be tokenized and fed into the model during training. The model's task is to learn to generate the text in the ### Response: section given the preceding instruction and context.
Here we define the LoRA-specific configuration using the LoraConfig class from the peft library. This is where we specify which parts of the model to adapt and how.
r: The rank of the low-rank matrices. A smaller r means fewer trainable parameters and faster training, but potentially less expressive power. A common range is 8 to 64.lora_alpha: The scaling factor for the LoRA updates, calculated as alpha / r. It acts as a learning rate for the adapters. A common practice is to set lora_alpha to be twice the value of r.target_modules: This is a list of the names of the modules to which we want to apply LoRA. For transformer models, these are often the attention block projections (q_proj, k_proj, v_proj, o_proj).lora_dropout: A dropout probability for the LoRA layers to prevent overfitting.task_type: Specifies the task type, which is CAUSAL_LM for our generative text model.from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"], # Target attention blocks
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Wrap the base model with LoRA config
peft_model = get_peft_model(model, lora_config)
After creating the configuration, we use get_peft_model to wrap our quantized base model. This function finds the modules specified in target_modules and injects the LoRA adapters.
Let's verify the dramatic reduction in trainable parameters.
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"Trainable params: {trainable_params} || All params: {all_param} || "
f"Trainable %: {100 * trainable_params / all_param:.2f}"
)
print("Trainable parameters with LoRA:")
print_trainable_parameters(peft_model)
You should see an output indicating that the trainable parameters are only a tiny fraction, often less than 1%, of the total model parameters. This is the core efficiency of LoRA.
The number of parameters updated during LoRA fine-tuning is orders of magnitude smaller than in full fine-tuning, while the total parameter count of the model remains the same.
With our PEFT model ready, we can proceed with training using the transformers.Trainer. We first define TrainingArguments to specify hyperparameters like the learning rate, number of epochs, and batch size.
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
# Define training arguments
training_args = TrainingArguments(
output_dir="./mistral-7b-lora-dolly",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=3,
logging_steps=10,
fp16=True, # Use mixed precision
save_total_limit=2,
overwrite_output_dir=True,
)
# Create the Trainer
trainer = Trainer(
model=peft_model,
args=training_args,
train_dataset=formatted_dataset,
tokenizer=tokenizer,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
# Start training
trainer.train()
The Trainer handles the entire training loop, including gradient updates, logging, and saving checkpoints. Because we are only updating the small LoRA adapters and not the massive base model, the memory required for the optimizer state and gradients is minimal, allowing this process to run on a single GPU.
After training completes, the peft_model object now contains the base model plus the trained LoRA adapters. To generate text, you can use the standard generate method. The PEFT library ensures that the adapter weights are automatically applied during the forward pass.
Let's test our fine-tuned model with a new instruction.
# Create a prompt for inference
prompt_text = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Explain the difference between supervised and unsupervised machine learning in simple terms.
### Response:
"""
# Tokenize the input
inputs = tokenizer(prompt_text, return_tensors="pt").to("cuda")
# Generate a response
output = peft_model.generate(
**inputs,
max_new_tokens=150,
eos_token_id=tokenizer.eos_token_id
)
# Decode and print the response
response_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(response_text)
The model's output should now follow the instruction-response format it learned during fine-tuning. Compare this to the output from the original base model; you should observe a noticeable improvement in its ability to adhere to the instruction-following style.
This hands-on session has demonstrated the end-to-end workflow for fine-tuning a large model with LoRA. You have successfully loaded a quantized model, prepared a dataset, configured and applied LoRA adapters, and executed the training loop. The next chapter will cover how to rigorously evaluate the performance of your newly fine-tuned model and prepare it for deployment.
Cleaner syntax. Built-in debugging. Production-ready from day one.
Built for the AI systems behind ApX Machine Learning
Was this section helpful?
bitsandbytes library within Hugging Face Transformers for 4-bit model quantization.© 2026 ApX Machine LearningEngineered with