Throughout this chapter, we've explored various strategies to enhance the generation component of your RAG system. Now, it's time to put theory into practice. This hands-on exercise will guide you through fine-tuning a smaller Large Language Model (LLM) specifically for a RAG-related task. Fine-tuning a smaller model offers a compelling balance of performance, cost-efficiency, and speed, making it an attractive option for production environments. By tailoring a model to your specific generation needs within the RAG pipeline, you can often achieve better results than with a larger, general-purpose model, especially concerning factual consistency with the retrieved context.
Our goal is to take a pre-trained smaller LLM and adapt it to generate answers based on provided context and a user query, a common task in RAG systems. We will focus on using Parameter-Efficient Fine-Tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA), to make this process computationally feasible and efficient.
Before you begin, ensure you have the following:
transformers
: For accessing pre-trained models and tokenizers.datasets
: For handling and preparing data.peft
: For applying LoRA or other PEFT methods.accelerate
: To simplify running PyTorch training on various hardware.torch
: The PyTorch library.evaluate
and rouge_score
: For evaluation metrics.
You can install these using pip:pip install transformers datasets peft accelerate torch evaluate rouge_score
t5-small
from Hugging Face. It's well-suited for conditional generation tasks.[
{
"context": "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower.",
"question": "Who is the Eiffel Tower named after?",
"answer": "The Eiffel Tower is named after the engineer Gustave Eiffel."
},
// ... more examples
]
We've chosen t5-small
as our base model. T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks, making it versatile for various generation tasks. The task is: given a context
and a question
, generate an answer
that is factually grounded in the context
. We'll need to format our input to the T5 model appropriately, often by prefixing the input string with a task-specific instruction, like "answer the question based on the context:".
Dataset preparation is a significant step. The quality of your fine-tuning data directly impacts the performance of your specialized LLM.
Load your dataset: If you have a JSON file as described above, you can load it using the datasets
library.
from datasets import load_dataset
# Assuming your data is in 'rag_finetuning_data.jsonl'
# For demonstration, let's create a dummy dataset
data_files = {"train": "path_to_your_train_data.jsonl", "validation": "path_to_your_validation_data.jsonl"}
# raw_datasets = load_dataset("json", data_files=data_files)
# For this example, let's use a small dummy dataset
# In a real scenario, you'd load your actual dataset
from datasets import Dataset
dummy_data = {
"train": Dataset.from_list([
{"id": "1", "context": "Paris is the capital of France. It is known for the Eiffel Tower.", "question": "What is the capital of France?", "answer": "Paris is the capital of France."},
{"id": "2", "context": "The Amazon rainforest is the largest tropical rainforest. It spans nine countries.", "question": "How many countries does the Amazon rainforest span?", "answer": "It spans nine countries."}
]),
"validation": Dataset.from_list([
{"id": "3", "context": "Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas.", "question": "Where is Mount Everest located?", "answer": "Mount Everest is located in the Mahalangur Himal sub-range of the Himalayas."}
])
}
raw_datasets = dummy_data
Tokenization and Formatting: We need to tokenize the inputs (context and question) and the targets (answers). For T5, a common practice is to concatenate the context and question, often with a prefix.
from transformers import AutoTokenizer
model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
max_input_length = 512 # Adjust based on your context lengths
max_target_length = 64 # Adjust based on your answer lengths
prefix = "answer the question based on the context: "
def preprocess_function(examples):
inputs = [prefix + "question: " + q + " context: " + c for q, c in zip(examples["question"], examples["context"])]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length")
# Tokenize targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["answer"], max_length=max_target_length, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
# Ensure -100 is used for padding tokens in labels so they are ignored in the loss function
for i, label_ids in enumerate(model_inputs["labels"]):
model_inputs["labels"][i] = [lid if lid != tokenizer.pad_token_id else -100 for lid in label_ids]
return model_inputs
tokenized_datasets = {
"train": raw_datasets["train"].map(preprocess_function, batched=True, remove_columns=raw_datasets["train"].column_names),
"validation": raw_datasets["validation"].map(preprocess_function, batched=True, remove_columns=raw_datasets["validation"].column_names)
}
The prefix
helps the model understand the task. Concatenating "question: " and "context: " explicitly labels these parts for the model. When tokenizing labels, tokenizer.as_target_tokenizer()
is used for sequence-to-sequence models. Padded label tokens are set to -100 to be ignored by the loss function.
Full fine-tuning of even "small" LLMs can be resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA allow us to fine-tune only a small subset of model parameters, significantly reducing computational requirements and storage.
Load the base model:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
Configure LoRA:
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
r=16, # Rank of the update matrices. Higher rank means more parameters.
lora_alpha=32, # Alpha scaling factor.
target_modules=["q", "v"], # Apply LoRA to query and value weights in attention
lora_dropout=0.05,
bias="none",
task_type=TaskType.SEQ_2_SEQ_LM # For sequence-to-sequence models like T5
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# Example output: trainable params: XXXX || all params: YYYYYY || trainable%: Z.ZZZZ
This configuration specifies that LoRA will be applied to the query (q
) and value (v
) projection matrices in the attention layers. The print_trainable_parameters
method will show how drastically LoRA reduces the number of parameters that need to be updated.
Set up Training Arguments and Trainer:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
data_collator = DataCollatorForSeq2Seq(
tokenizer,
model=peft_model,
label_pad_token_id=-100, # Important for ignoring padding in loss calculation
pad_to_multiple_of=8 # Optional: for TPU efficiency
)
output_dir = "t5-small-rag-finetuned-lora"
training_args = Seq2SeqTrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=4, # Adjust based on your GPU memory
per_device_eval_batch_size=4, # Adjust based on your GPU memory
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
learning_rate=1e-4, # Common learning rate for LoRA
num_train_epochs=3, # Adjust as needed
logging_dir=f"{output_dir}/logs",
logging_steps=10,
evaluation_strategy="epoch", # Evaluate at the end of each epoch
save_strategy="epoch", # Save model at the end of each epoch
load_best_model_at_end=True,
metric_for_best_model="eval_loss", # Or a ROUGE score if you implement compute_metrics
predict_with_generate=True, # Necessary for generating text during evaluation
fp16=True, # Enable mixed-precision training if GPU supports it
)
trainer = Seq2SeqTrainer(
model=peft_model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
)
Start Training:
trainer.train()
This will initiate the fine-tuning process. Monitor the training and validation loss.
Save the PEFT Adapter: After training, only the LoRA adapter weights need to be saved, which are very small.
peft_model.save_pretrained(f"{output_dir}/best_lora_adapter")
# You can also save the tokenizer for convenience
tokenizer.save_pretrained(f"{output_dir}/best_lora_adapter")
Evaluating the fine-tuned model is critical. We need to assess its ability to generate accurate and relevant answers based on the provided context.
Define compute_metrics
function for the Trainer (Optional but Recommended):
You can use metrics like ROUGE, which measures the overlap between the generated answer and the reference answer.
import numpy as np
import evaluate
rouge_metric = evaluate.load("rouge")
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
# Decode generated summaries, handling -100 pads
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# ROUGE expects a newline after each sentence
decoded_preds = ["\n".join(pred.strip().split()) for pred in decoded_preds]
decoded_labels = ["\n".join(label.strip().split()) for label in decoded_labels]
result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
# Extract specific ROUGE scores
result = {key: value * 100 for key, value in result.items()}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
return result
# Re-initialize trainer with compute_metrics if you define it
# trainer = Seq2SeqTrainer(
# ...
# compute_metrics=compute_metrics,
# ...
# )
# Then call trainer.evaluate() or it will be called during training if evaluation_strategy is set.
If you integrate compute_metrics
into the Seq2SeqTrainer
, it will automatically calculate these metrics during evaluation phases. For this walkthrough, we are focusing on the eval_loss
for selecting the best model.
Qualitative Evaluation: Perform a qualitative review of the generated outputs.
Load the fine-tuned adapter:
from peft import PeftModel
# Load the base model
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
# Load the LoRA adapter
fine_tuned_model = PeftModel.from_pretrained(base_model, f"{output_dir}/best_lora_adapter")
fine_tuned_model = fine_tuned_model.to("cuda" if torch.cuda.is_available() else "cpu") # Move to device
fine_tuned_model.eval() # Set to evaluation mode
# If you saved the tokenizer along with the adapter:
# fine_tuned_tokenizer = AutoTokenizer.from_pretrained(f"{output_dir}/best_lora_adapter")
# else, use the original tokenizer
fine_tuned_tokenizer = tokenizer
Generate answers for some test examples:
test_context = "The Atacama Desert is a desert plateau in South America covering a 1,600 km strip of land on the Pacific coast, west of the Andes Mountains. It is the driest nonpolar desert."
test_question = "What is the Atacama Desert?"
input_text = f"{prefix}question: {test_question} context: {test_context}"
input_ids = fine_tuned_tokenizer(input_text, return_tensors="pt", max_length=max_input_length, truncation=True).input_ids.to(fine_tuned_model.device)
outputs = fine_tuned_model.generate(input_ids, max_length=max_target_length, num_beams=4, early_stopping=True)
generated_answer = fine_tuned_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Context: {test_context}")
print(f"Question: {test_question}")
print(f"Generated Answer: {generated_answer}")
Check if the answers are:
Comparison: Compare these outputs against:
t5-small
model (without fine-tuning).A simple quantitative comparison could involve ROUGE scores if you have a labeled test set:
Simulated ROUGE-L scores showing potential improvement after RAG-specific fine-tuning. Actual results will vary based on data and training.
Once satisfied with your fine-tuned smaller LLM:
r
, lora_alpha
), learning rates, and other training parameters.This hands-on exercise demonstrates a technique for optimizing the generation component of your RAG system. By fine-tuning smaller LLMs like T5-small with PEFT methods, you can create specialized, efficient, and effective generators tailored to your production needs, leading to higher quality outputs and better resource utilization. This approach is a step towards building maintainable RAG solutions.
Was this section helpful?
© 2025 ApX Machine Learning