After training a model, its performance must be thoroughly measured to ensure effectiveness. This content guides learners through evaluating a fine-tuned model. It combines automated, quantitative metrics with manual, qualitative review to build a complete picture of a model's capabilities and limitations.
Before we begin, ensure your environment is prepared. We will need the fine-tuned model adapters, the original base model, and, most importantly, a hold-out test dataset. It is a fundamental principle of machine learning that a model must be evaluated on data it has never seen during training or validation to get an unbiased assessment of its generalization ability.
Our evaluation workflow will proceed as follows:
The evaluation process, from data input to final assessment.
First, let's install the evaluate library from Hugging Face, which provides easy access to common metrics.
pip install evaluate rouge_score
We'll start by loading the base model and applying our trained LoRA adapters. For this exercise, assume your LoRA adapter configuration and weights are saved in a directory named ./my-lora-adapter. We will also load the test split from the dataset used in training.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset
# Define model and adapter paths
base_model_id = "meta-llama/Llama-2-7b-hf"
adapter_path = "./my-lora-adapter"
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
# Load the LoRA model
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()
# Load the test dataset
test_dataset = load_dataset("samsum", split="test")
Note: For inference, it is common to merge the LoRA weights directly into the base model using
model.merge_and_unload(). This creates a standard model artifact, simplifying deployment as you no longer need the PEFT library at inference time. We are keeping them separate here to illustrate the components.
With our model and data ready, the next step is to generate a prediction for each sample in our test dataset. We will iterate through the dataset, format the input prompt just as we did for training, tokenize it, and pass it to the model's generate method. We will store both the model's output (prediction) and the ground-truth summary (reference).
import pandas as pd
from tqdm import tqdm
# Let's generate predictions for a subset for this example
test_samples = test_dataset.select(range(100))
predictions = []
references = []
for sample in tqdm(test_samples):
prompt = f"""
Summarize the following conversation.
{sample["dialogue"]}
Summary:
"""
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
# Generate output
with torch.no_grad():
outputs = model.generate(
input_ids=input_ids,
max_new_tokens=50,
do_sample=True,
top_p=0.9,
temperature=0.1
)
# Decode the output and the reference
prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Clean up the output to only get the summary
summary = prediction.split("Summary:")[1].strip()
predictions.append(summary)
references.append(sample["summary"])
Now that we have our predictions and references, we can compute automated metrics.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced). ROUGE-L, for instance, measures the longest common subsequence.
BLEU (Bilingual Evaluation Understudy) is another popular metric, primarily for translation, which measures precision by comparing n-grams of the candidate with n-grams of the reference.
import evaluate
# Load the ROUGE metric
rouge_metric = evaluate.load("rouge")
# Compute scores
rouge_scores = rouge_metric.compute(predictions=predictions, references=references)
print(rouge_scores)
The output will be a dictionary containing different ROUGE scores, like rouge1, rouge2, and rougeL. These numbers provide a standardized way to compare model performance. A higher score generally indicates better alignment with the reference texts.
ROUGE scores for a sample evaluation run. ROUGE-1 measures unigram overlap, ROUGE-2 measures bigram overlap, and ROUGE-L measures the longest common subsequence.
Perplexity is a measure of how well a probability distribution or probability model predicts a sample. In the context of language models, it can be interpreted as a measure of the model's "surprise" when encountering the test set. It is calculated from the cross-entropy loss and is often expressed as . A lower perplexity score indicates that the model is more confident in its predictions, which is a desirable trait.
Calculating perplexity requires us to compute the loss on the test set without updating any gradients.
# A simplified function to calculate perplexity
def calculate_perplexity(model, tokenizer, data, device="cuda"):
total_loss = 0
total_tokens = 0
for sample in tqdm(data):
text = sample["dialogue"] + " " + sample["summary"]
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
input_ids = inputs.input_ids.to(device)
target_ids = input_ids.clone()
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
loss = outputs.loss
total_loss += loss.item() * input_ids.size(1)
total_tokens += input_ids.size(1)
avg_loss = total_loss / total_tokens
perplexity = torch.exp(torch.tensor(avg_loss))
return perplexity.item()
# Calculate perplexity on our test samples
perplexity_score = calculate_perplexity(model, tokenizer, test_samples)
print(f"Perplexity: {perplexity_score:.2f}")
A lower perplexity, for example a score of 15.43, would suggest the model has learned the patterns in the target domain well.
Metrics provide a high-level summary, but they do not capture everything. A model might achieve a good ROUGE score but produce summaries that are factually incorrect or stylistically awkward. This is where qualitative analysis becomes necessary.
By manually inspecting the model's outputs, you can check for:
Let's examine a few examples side-by-side.
| Input Dialogue | Reference Summary | Model's Generated Summary | Analysis |
|---|---|---|---|
Amanda: Can you pick up dinner tonight? <br> Ben: Sure, what do you want? <br> Amanda: Pizza would be great. <br> Ben: Ok, I'll grab it on my way home. |
Amanda asked Ben to get pizza for dinner on his way home. | Ben will pick up a pizza for dinner on his way home from work. | Good. The model correctly identifies the main points and produces a fluent, accurate summary. |
Chloe: The project deadline is Friday. <br> David: I know, I'm almost done with the report. I just need to add the final charts. <br> Chloe: Great, send it over when you're ready. |
David is finishing a report with charts for a project due on Friday. | David needs to finish his report. The deadline is Friday. | Fair. The summary is correct but misses the specific detail about the charts. It's less informative than the reference. |
Eve: I'm thinking of booking a flight to Paris. <br> Frank: Nice! When are you going? <br> Eve: Maybe in May. It's lovely there in the spring. |
Eve is planning a trip to Paris, possibly in May. | Eve is booking a flight to Paris for her vacation next week and will be staying at the Grand Hotel. | Poor (Hallucination). The model invented details not present in the dialogue (vacation, next week, Grand Hotel). |
This manual review is an indispensable part of the evaluation loop. It provides insights that quantitative scores cannot, helping you identify specific failure modes that can be addressed by curating better training data or adjusting the fine-tuning process. By combining both quantitative and qualitative methods, you gain a well-rounded understanding of your model's true performance.
Cleaner syntax. Built-in debugging. Production-ready from day one.
Built for the AI systems behind ApX Machine Learning
Was this section helpful?
© 2026 ApX Machine LearningEngineered with