All Courses

Hands-on: Fine-tuning a Smaller LLM for a RAG Task

Throughout this chapter, we've explored various strategies to enhance the generation component of your RAG system. Now, it's time to put theory into practice. This hands-on exercise will guide you through fine-tuning a smaller Large Language Model (LLM) specifically for a RAG-related task. Fine-tuning a smaller model offers a compelling balance of performance, cost-efficiency, and speed, making it an attractive option for production environments. By tailoring a model to your specific generation needs within the RAG pipeline, you can often achieve better results than with a larger, general-purpose model, especially concerning factual consistency with the retrieved context.

Our goal is to take a pre-trained smaller LLM and adapt it to generate answers based on provided context and a user query, a common task in RAG systems. We will focus on using Parameter-Efficient Fine-Tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA), to make this process computationally feasible and efficient.

Prerequisites

Before you begin, ensure you have the following:

Python Environment: Python 3.8 or higher.
Libraries:
- transformers: For accessing pre-trained models and tokenizers.
- datasets: For handling and preparing data.
- peft: For applying LoRA or other PEFT methods.
- accelerate: To simplify running PyTorch training on various hardware.
- torch: The PyTorch library.
- evaluate and rouge_score: For evaluation metrics. You can install these using pip:
```
pip install transformers datasets peft accelerate torch evaluate rouge_score
```
A Base Model: We'll use a relatively small, yet capable, model like t5-small from Hugging Face. It's well-suited for conditional generation tasks.

A Dataset: For this exercise, we'll need a dataset where each example consists of a context, a question, and a ground-truth answer derived from the context. You can adapt a subset of a QA dataset like SQuAD or create a small custom dataset. The format should be:

[
    {
    "context": "Kuala Lumpur is the capital city of Malaysia, known for its iconic Petronas Twin Towers. It is also a major economic and cultural hub in Southeast Asia.",
    "question": "What is the capital of Malaysia?",
    "answer": "Kuala Lumpur is the capital city of Malaysia."
},
{
    "context": "The Petronas Twin Towers were once the tallest buildings and remain an architectural marvel in Kuala Lumpur, Malaysia. They were completed in 1998.",
    "question": "When were the Petronas Twin Towers completed?",
    "answer": "The Petronas Twin Towers were completed in 1998."
},
// ... more examples
]

Step 1: Model and Task Selection

We've chosen t5-small as our base model. T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks, making it versatile for various generation tasks. The task is: given a context and a question, generate an answer that is factually grounded in the context. We'll need to format our input to the T5 model appropriately, often by prefixing the input string with a task-specific instruction, like "answer the question based on the context:".

Step 2: Preparing the Fine-tuning Dataset

Dataset preparation is a significant step. The quality of your fine-tuning data directly impacts the performance of your specialized LLM.

Load your dataset: If you have a JSON file as described above, you can load it using the datasets library.

from datasets import load_dataset

# Assuming your data is in 'rag_finetuning_data.jsonl'
# For demonstration, let's create a dummy dataset
data_files = {"train": "path_to_your_train_data.jsonl", "validation": "path_to_your_validation_data.jsonl"}
# raw_datasets = load_dataset("json", data_files=data_files)
# For this example, let's use a small dummy dataset
# In a real scenario, you'd load your actual dataset
from datasets import Dataset
dummy_data = {
    "train": Dataset.from_list([
        {"id": "1", "context": "Kuala Lumpur is the capital city of Malaysia, known for its iconic Petronas Twin Towers.", "question": "What is the capital of Malaysia?", "answer": "Kuala Lumpur is the capital city of Malaysia."},
        {"id": "2", "context": "The Petronas Twin Towers were once the tallest buildings and remain an architectural marvel in Kuala Lumpur, Malaysia. They were completed in 1998.", "question": "When were the Petronas Twin Towers completed?", "answer": "The Petronas Twin Towers were completed in 1998."}
    ]),
    "validation": Dataset.from_list([
        {"id": "3", "context": "Langkawi is an archipelago of 99 islands in the Andaman Sea, off the west coast of Malaysia.", "question": "Where is Langkawi located?", "answer": "Langkawi is located off the west coast of Malaysia."}
    ])
}
raw_datasets = dummy_data

Tokenization and Formatting: We need to tokenize the inputs (context and question) and the targets (answers). For T5, a common practice is to concatenate the context and question, often with a prefix.

from transformers import AutoTokenizer

model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

max_input_length = 512  # Adjust based on your context lengths
max_target_length = 64   # Adjust based on your answer lengths
prefix = "answer the question based on the context: "

def preprocess_function(examples):
    inputs = [prefix + "question: " + q + " context: " + c for q, c in zip(examples["question"], examples["context"])]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length")

    # Tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["answer"], max_length=max_target_length, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    # Ensure -100 is used for padding tokens in labels so they are ignored in the loss function
    for i, label_ids in enumerate(model_inputs["labels"]):
        model_inputs["labels"][i] = [lid if lid != tokenizer.pad_token_id else -100 for lid in label_ids]
        
    return model_inputs

tokenized_datasets = {
    "train": raw_datasets["train"].map(preprocess_function, batched=True, remove_columns=raw_datasets["train"].column_names),
    "validation": raw_datasets["validation"].map(preprocess_function, batched=True, remove_columns=raw_datasets["validation"].column_names)
}

The prefix helps the model understand the task. Concatenating "question: " and "context: " explicitly labels these parts for the model. When tokenizing labels, tokenizer.as_target_tokenizer() is used for sequence-to-sequence models. Padded label tokens are set to -100 to be ignored by the loss function.

Step 3: Fine-tuning with PEFT (LoRA)

Full fine-tuning of even "small" LLMs can be resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA allow us to fine-tune only a small subset of model parameters, significantly reducing computational requirements and storage.

Load the base model:

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Configure LoRA:

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,  # Rank of the update matrices. Higher rank means more parameters.
    lora_alpha=32,  # Alpha scaling factor.
    target_modules=["q", "v"],  # Apply LoRA to query and value weights in attention
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # For sequence-to-sequence models like T5
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# Example output: trainable params: XXXX || all params: YYYYYY || trainable%: Z.ZZZZ

This configuration specifies that LoRA will be applied to the query (q) and value (v) projection matrices in the attention layers. The print_trainable_parameters method will show how drastically LoRA reduces the number of parameters that need to be updated.

Set up Training Arguments and Trainer:

from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=peft_model,
    label_pad_token_id=-100, # Important for ignoring padding in loss calculation
    pad_to_multiple_of=8 # Optional: for TPU efficiency
)

output_dir = "t5-small-rag-finetuned-lora"

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4, # Adjust based on your GPU memory
    per_device_eval_batch_size=4,  # Adjust based on your GPU memory
    gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
    learning_rate=1e-4, # Common learning rate for LoRA
    num_train_epochs=3,   # Adjust as needed
    logging_dir=f"{output_dir}/logs",
    logging_steps=10,
    evaluation_strategy="epoch", # Evaluate at the end of each epoch
    save_strategy="epoch",       # Save model at the end of each epoch
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss", # Or a ROUGE score if you implement compute_metrics
    predict_with_generate=True, # Necessary for generating text during evaluation
    fp16=True, # Enable mixed-precision training if GPU supports it
)

trainer = Seq2SeqTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Start Training:
```
trainer.train()
```
This will initiate the fine-tuning process. Monitor the training and validation loss.

Save the PEFT Adapter: After training, only the LoRA adapter weights need to be saved, which are very small.

peft_model.save_pretrained(f"{output_dir}/best_lora_adapter")
# You can also save the tokenizer for convenience
tokenizer.save_pretrained(f"{output_dir}/best_lora_adapter")

Step 4: Evaluation

Evaluating the fine-tuned model is critical. We need to assess its ability to generate accurate and relevant answers based on the provided context.

Define compute_metrics function for the Trainer (Optional but Recommended): You can use metrics like ROUGE, which measures the overlap between the generated answer and the reference answer.

import numpy as np
import evaluate

rouge_metric = evaluate.load("rouge")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    
    # Decode generated summaries, handling -100 pads
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(pred.strip().split()) for pred in decoded_preds]
    decoded_labels = ["\n".join(label.strip().split()) for label in decoded_labels]
    
    result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    
    # Extract specific ROUGE scores
    result = {key: value * 100 for key, value in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

# Re-initialize trainer with compute_metrics if you define it
# trainer = Seq2SeqTrainer(
#     ...
#     compute_metrics=compute_metrics,
#     ...
# )
# Then call trainer.evaluate() or it will be called during training if evaluation_strategy is set.

If you integrate compute_metrics into the Seq2SeqTrainer, it will automatically calculate these metrics during evaluation phases. For this walkthrough, we are focusing on the eval_loss for selecting the best model.

Qualitative Evaluation: Perform a qualitative review of the generated outputs.

Load the fine-tuned adapter:

from peft import PeftModel

# Load the base model
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
# Load the LoRA adapter
fine_tuned_model = PeftModel.from_pretrained(base_model, f"{output_dir}/best_lora_adapter")
fine_tuned_model = fine_tuned_model.to("cuda" if torch.cuda.is_available() else "cpu") # Move to device
fine_tuned_model.eval() # Set to evaluation mode

# If you saved the tokenizer along with the adapter:
# fine_tuned_tokenizer = AutoTokenizer.from_pretrained(f"{output_dir}/best_lora_adapter")
# else, use the original tokenizer
fine_tuned_tokenizer = tokenizer

Generate answers for some test examples:

test_context = "The Atacama Desert is a desert plateau in South America covering a 1,600 km strip of land on the Pacific coast, west of the Andes Mountains. It is the driest nonpolar desert."
test_question = "What is the Atacama Desert?"

input_text = f"{prefix}question: {test_question} context: {test_context}"
input_ids = fine_tuned_tokenizer(input_text, return_tensors="pt", max_length=max_input_length, truncation=True).input_ids.to(fine_tuned_model.device)

outputs = fine_tuned_model.generate(input_ids, max_length=max_target_length, num_beams=4, early_stopping=True)
generated_answer = fine_tuned_tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Context: {test_context}")
print(f"Question: {test_question}")
print(f"Generated Answer: {generated_answer}")

Check if the answers are:

Factually correct according to the context.
Relevant to the question.
Fluent and coherent.
Avoiding hallucinations (i.e., not introducing information not present in the context).

Comparison: Compare these outputs against:
- The base t5-small model (without fine-tuning).
- Ground-truth answers.
A simple quantitative comparison could involve ROUGE scores if you have a labeled test set:

Simulated ROUGE-L scores showing potential improvement after RAG-specific fine-tuning. Actual results will vary based on data and training.

Step 5: Integration and Further Considerations

Once satisfied with your fine-tuned smaller LLM:

Integration: This model (base model + LoRA adapter) can now replace the generator component in your RAG pipeline. When making predictions, load the base model and then apply the trained LoRA weights.
Efficiency: You've likely achieved a model that is faster and requires less computational power for inference compared to a larger, general-purpose LLM, while potentially offering better, more context-aware generation for your specific RAG task.
Iterative Improvement:
- Data Augmentation: If performance isn't optimal, consider augmenting your fine-tuning dataset with more diverse examples or examples that target specific failure modes.
- Hyperparameter Tuning: Experiment with LoRA configurations (r, lora_alpha), learning rates, and other training parameters.
- Task Adaptation: If your RAG system requires different generation styles (e.g., summarization vs. direct QA), you might fine-tune separate adapters or a single adapter on a mixed-task dataset.
Catastrophic Forgetting: Be mindful that fine-tuning can sometimes lead the model to "forget" some of its general capabilities. If your RAG task requires broad knowledge alongside context adherence, ensure your fine-tuning data and evaluation cover this. LoRA inherently mitigates this to some extent compared to full fine-tuning because the base model's weights are frozen.

This hands-on exercise demonstrates a technique for optimizing the generation component of your RAG system. By fine-tuning smaller LLMs like T5-small with PEFT methods, you can create specialized, efficient, and effective generators tailored to your production needs, leading to higher quality outputs and better resource utilization. This approach is a step towards building maintainable RAG solutions.

Was this section helpful?