You've learned about the various strategies to optimize Large Language Models for large-scale RAG systems, from efficient serving to architectural choices. Now, it's time to get your hands dirty. This practical exercise focuses on a common and highly effective optimization: fine-tuning an LLM using Parameter-Efficient Fine-Tuning (PEFT) to enhance its performance on a specific RAG task. The goal is to improve the LLM's ability to synthesize accurate and relevant answers based strictly on the context provided by the retrieval stage, particularly for specialized domains or when a specific response style is required.
We'll walk through the process of preparing data, choosing a model, applying LoRA (Low-Rank Adaptation), and evaluating the results, all with the lens of an expert building production-grade RAG systems.
By the end of this practical, you will be able to:
Before you begin, ensure you have a Python environment with a recent GPU (highly recommended for reasonable training times). You'll need the following core libraries:
torch
: For tensor operations and GPU support.transformers
: From Hugging Face, for LLM models and tokenizers.peft
: From Hugging Face, for Parameter-Efficient Fine-Tuning techniques like LoRA.datasets
: From Hugging Face, for easy data handling.accelerate
: To simplify distributed training and mixed-precision (even on a single GPU, it's useful).bitsandbytes
: For 8-bit or 4-bit quantization (e.g., QLoRA), if you want to experiment with further memory reduction.You can typically install these using pip:
pip install torch transformers peft datasets accelerate bitsandbytes
Ensure your CUDA drivers and PyTorch installation are compatible with your GPU.
The quality and structure of your fine-tuning data are critical for success. For RAG, you aren't just teaching the LLM general knowledge; you're teaching it to reason over provided text. An ideal dataset consists of triplets: (query, retrieved_context, ideal_answer_grounded_in_context).
The input to the LLM during fine-tuning should mimic the prompt structure you'll use in your RAG system. A common format is:
<s>[INST] Context: {retrieved_document_chunk}
Question: {user_query} [/INST]
Answer: {ideal_answer_based_on_context}</s>
<s>
and </s>
: Start and end of sequence tokens.[INST]
and [/INST]
: Instruction tags, common in models like Llama and Mistral. Adapt these based on your chosen base model's preferred prompt format.{retrieved_document_chunk}
: The actual text snippet that your retriever would provide.{user_query}
: The user's question.{ideal_answer_based_on_context}
: The desired output. This answer must be derivable solely from the provided retrieved_document_chunk
. Avoid answers that require external knowledge.Example Data Point (JSONL format):
{
"text": "<s>[INST] Context: The Llama 2 family of models includes versions with 7B, 13B, and 70B parameters. LoRA fine-tuning is effective for adapting these models to specific tasks while keeping most weights frozen. For the 7B model, a LoRA rank (r) of 8 or 16 is often a good starting point. \nQuestion: What LoRA rank is suggested for Llama 2 7B? [/INST]\nAnswer: For the Llama 2 7B model, a LoRA rank of 8 or 16 is often a good starting point for fine-tuning.</s>"
}
Crafting High-Quality Data:
For this exercise, you might create a small dataset of 50-100 examples manually or use a script to generate them from a document you have. Save it as a train.jsonl
file.
The choice of base model depends on your performance requirements, computational budget, and the complexity of your task. Models like Mistral-7B, Llama-2-7B, or Gemma-7B are excellent starting points for PEFT due to their strong foundational capabilities and manageable size for fine-tuning with LoRA.
We'll use LoRA. The LoRA parameters to configure:
r
: The rank of the update matrices. A smaller r
means fewer trainable parameters. Common values range from 4 to 64.lora_alpha
: A scaling factor, often set to r
or 2*r
.target_modules
: Specifies which linear layers in the transformer to apply LoRA to (e.g., q_proj
, v_proj
, k_proj
, o_proj
). Identifying these often requires inspecting the model architecture.lora_dropout
: Dropout probability for LoRA layers.bias
: Whether to make LoRA bias terms trainable (e.g., "none", "all", "lora_only").Let's assume we're using a model like mistralai/Mistral-7B-Instruct-v0.1
.
Here's a Python script outline using Hugging Face libraries.
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
# 1. Configuration
model_name = "mistralai/Mistral-7B-Instruct-v0.1" # Or your chosen model
dataset_path = "path/to/your/train.jsonl" # Your JSONL file
output_dir = "./results_rag_finetune"
lora_r = 16
lora_alpha = 32
lora_dropout = 0.05
# For QLoRA (4-bit quantization)
use_4bit = True
bnb_4bit_quant_type = "nf4"
bnb_4bit_compute_dtype = torch.bfloat16 # or torch.float16 if bfloat16 not supported
# 2. Load Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Common practice
tokenizer.padding_side = "right"
if use_4bit:
bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
bnb_4bit_use_double_quant=True, # Optional
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map={"": 0} # Load model on GPU 0
)
model = prepare_model_for_kbit_training(model)
else:
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map={"": 0} # Load model on GPU 0
)
model.config.use_cache = False # Recommended for fine-tuning
model.config.pretraining_tp = 1 # If you see this error, set to 1
# 3. LoRA Configuration
# Find target modules by inspecting model.named_modules() or common sense for your model
# For Mistral, common targets are 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'
peft_config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
target_modules=["q_proj", "v_proj"], # Start small, add more if needed
lora_dropout=lora_dropout,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters() # Check how many parameters are trainable
# 4. Load Dataset
dataset = load_dataset("json", data_files=dataset_path, split="train")
# 5. Training Arguments
training_arguments = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=2, # Adjust based on your GPU VRAM
gradient_accumulation_steps=4, # Effective batch size = 2 * 4 = 8
optim="paged_adamw_32bit", # Or "adamw_torch" if not using QLoRA
save_steps=50, # Save checkpoints every 50 steps
logging_steps=10, # Log training progress
learning_rate=2e-4,
fp16=not use_4bit, # Use fp16 if not using 4-bit
bf16=use_4bit and torch.cuda.is_bf16_supported(), # Use bf16 if 4-bit and supported
max_grad_norm=0.3,
num_train_epochs=1, # Start with 1-3 epochs for small datasets
warmup_ratio=0.03,
group_by_length=True, # Speeds up training by grouping similar length sequences
lr_scheduler_type="constant", # Or "cosine"
report_to="tensorboard" # Or "wandb"
)
# 6. Initialize Trainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text", # The field in your JSONL containing the full prompt
max_seq_length=1024, # Adjust based on your context length and VRAM
tokenizer=tokenizer,
args=training_arguments,
packing=False, # Set to True if you want to pack multiple short sequences
)
# 7. Start Training
print("Starting training...")
trainer.train()
# 8. Save the fine-tuned adapter
adapter_output_dir = f"{output_dir}/final_adapter"
trainer.model.save_pretrained(adapter_output_dir)
tokenizer.save_pretrained(adapter_output_dir) # Save tokenizer for consistency
print(f"Fine-tuned adapter saved to {adapter_output_dir}")
Main Points during Training:
per_device_train_batch_size
, gradient_accumulation_steps
, max_seq_length
, and quantization settings (use_4bit
) to fit within your GPU's memory.target_modules
: The choice of target_modules
for LoRA can significantly impact performance. Experimentation is often needed. For many attention-based models, targeting query, key, value, and output projections (q_proj
, k_proj
, v_proj
, o_proj
) is a good start. Some architectures also benefit from targeting feed-forward network layers.After training, the LoRA adapter (not the full model) is saved. To use it for inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model_name = "mistralai/Mistral-7B-Instruct-v0.1" # Same base model
adapter_path = "./results_rag_finetune/final_adapter" # Path to your saved adapter
# Load the base model (can be quantized as well for inference)
# For 4-bit inference:
# bnb_config = BitsAndBytesConfig(
# load_in_4bit=True,
# bnb_4bit_quant_type="nf4",
# bnb_4bit_compute_dtype=torch.bfloat16,
# bnb_4bit_use_double_quant=True,
# )
# base_model = AutoModelForCausalLM.from_pretrained(
# base_model_name,
# quantization_config=bnb_config,
# device_map={"": 0}
# )
# Or without quantization for full precision (higher VRAM usage)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16, # or torch.float16
device_map={"": 0}
)
tokenizer = AutoTokenizer.from_pretrained(adapter_path) # Load tokenizer from adapter dir
# Load the PEFT model by merging the adapter into the base model
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.eval() # Set to evaluation mode
# Optional: Merge LoRA layers with base model for faster inference
# This creates a new model and might require more VRAM initially.
# model = model.merge_and_unload()
# print("LoRA layers merged.")
# Example RAG-style prompt
retrieved_context = "The LoRA technique adapts large pre-trained models by inserting trainable low-rank matrices into existing layers. This significantly reduces the number of trainable parameters compared to full fine-tuning, making it memory-efficient."
user_query = "How does LoRA achieve memory efficiency?"
prompt = f"<s>[INST] Context: {retrieved_context}\nQuestion: {user_query} [/INST]\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate response
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the generated answer part
answer_part = response_text.split("[/INST]\nAnswer:")[1].strip()
print(f"Generated Answer: {answer_part}")
Evaluation is critical. Standard LLM metrics like perplexity are insufficient for RAG. You need to assess:
Comparative Analysis: The most effective way to demonstrate improvement is to compare the fine-tuned model's outputs against the base model's outputs on the same set of (query, context) pairs.
Metric Area | Base Model Behavior (Example) | Fine-Tuned Model Behavior (Example) |
---|---|---|
Faithfulness | Often pulls in outside knowledge or slightly misinterprets. | Sticks closely to the provided text. |
Hallucination | May invent details if context is sparse or ambiguous. | More likely to state "cannot answer" or be cautiously factual. |
Relevance | Might over-summarize or miss the specific detail of the query. | Targets the query more precisely based on the context. |
Style/Domain | Generic language. | Adopts terminology/style from the fine-tuning data (if present). |
The table above illustrates potential improvements. Actual results depend on data quality, base model, and tuning.
For more systematic evaluation in large-scale systems, consider frameworks like RAGAs, which offer metrics for faithfulness, answer relevance, and context relevance. Building an evaluation suite with a dataset of challenging RAG queries is a best practice.
The fine-tuned model (either base model + LoRA adapter, or the merged model) needs to be deployed within your LLM serving infrastructure (e.g., vLLM, TGI, SageMaker, etc.).
merge_and_unload()
, you deploy the resulting model as a standard LLM. This can sometimes offer slightly lower inference latency as there's no adapter logic overhead, but you lose the flexibility of easily swapping adapters.Consider how your MLOps pipeline will handle retraining and deploying new adapter versions. The PEFT approach significantly simplifies this compared to full fine-tuning, as adapter files are small (megabytes vs. gigabytes).
This hands-on exercise demonstrated how PEFT, specifically LoRA, can be a powerful tool for tailoring LLMs to the specific demands of RAG systems. By fine-tuning on data that emphasizes contextual grounding, you can significantly improve the faithfulness and relevance of your RAG system's responses.
Further Steps for Expert Practitioners:
By mastering these techniques, you can build highly performant, reliable, and efficient large-scale distributed RAG systems that truly leverage the power of LLMs combined with vast external knowledge.
Was this section helpful?
© 2025 ApX Machine Learning