Practice: Evaluating a Fine-Tuned Model

After training a model, its performance must be thoroughly measured to ensure effectiveness. This content guides learners through evaluating a fine-tuned model. It combines automated, quantitative metrics with manual, qualitative review to build a complete picture of a model's capabilities and limitations.

Setting the Stage for Evaluation

Before we begin, ensure your environment is prepared. We will need the fine-tuned model adapters, the original base model, and, most importantly, a hold-out test dataset. It is a fundamental principle of machine learning that a model must be evaluated on data it has never seen during training or validation to get an unbiased assessment of its generalization ability.

Our evaluation workflow will proceed as follows:

Load the base model and merge the trained LoRA adapters.
Generate outputs for each entry in the test set.
Calculate quantitative metrics (ROUGE, BLEU, Perplexity).
Perform a qualitative review of selected outputs.

The evaluation process, from data input to final assessment.

First, let's install the evaluate library from Hugging Face, which provides easy access to common metrics.

pip install evaluate rouge_score

Loading the Fine-Tuned Model and Data

We'll start by loading the base model and applying our trained LoRA adapters. For this exercise, assume your LoRA adapter configuration and weights are saved in a directory named ./my-lora-adapter. We will also load the test split from the dataset used in training.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset

# Define model and adapter paths
base_model_id = "meta-llama/Llama-2-7b-hf"
adapter_path = "./my-lora-adapter"

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# Load the LoRA model
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()

# Load the test dataset
test_dataset = load_dataset("samsum", split="test")

Note: For inference, it is common to merge the LoRA weights directly into the base model using model.merge_and_unload(). This creates a standard model artifact, simplifying deployment as you no longer need the PEFT library at inference time. We are keeping them separate here to illustrate the components.

Generating Predictions

With our model and data ready, the next step is to generate a prediction for each sample in our test dataset. We will iterate through the dataset, format the input prompt just as we did for training, tokenize it, and pass it to the model's generate method. We will store both the model's output (prediction) and the ground-truth summary (reference).

import pandas as pd
from tqdm import tqdm

# Let's generate predictions for a subset for this example
test_samples = test_dataset.select(range(100))

predictions = []
references = []

for sample in tqdm(test_samples):
    prompt = f"""
    Summarize the following conversation.

    {sample["dialogue"]}

    Summary:
    """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

    # Generate output
    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids, 
            max_new_tokens=50, 
            do_sample=True, 
            top_p=0.9, 
            temperature=0.1
        )

    # Decode the output and the reference
    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Clean up the output to only get the summary
    summary = prediction.split("Summary:")[1].strip()

    predictions.append(summary)
    references.append(sample["summary"])

Quantitative Evaluation

Now that we have our predictions and references, we can compute automated metrics.

ROUGE and BLEU

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced). ROUGE-L, for instance, measures the longest common subsequence.

BLEU (Bilingual Evaluation Understudy) is another popular metric, primarily for translation, which measures precision by comparing n-grams of the candidate with n-grams of the reference.

import evaluate

# Load the ROUGE metric
rouge_metric = evaluate.load("rouge")

# Compute scores
rouge_scores = rouge_metric.compute(predictions=predictions, references=references)
print(rouge_scores)

The output will be a dictionary containing different ROUGE scores, like rouge1, rouge2, and rougeL. These numbers provide a standardized way to compare model performance. A higher score generally indicates better alignment with the reference texts.

ROUGE scores for a sample evaluation run. ROUGE-1 measures unigram overlap, ROUGE-2 measures bigram overlap, and ROUGE-L measures the longest common subsequence.

Perplexity

Perplexity is a measure of how well a probability distribution or probability model predicts a sample. In the context of language models, it can be interpreted as a measure of the model's "surprise" when encountering the test set. It is calculated from the cross-entropy loss and is often expressed as $e^{loss}$ . A lower perplexity score indicates that the model is more confident in its predictions, which is a desirable trait.

Calculating perplexity requires us to compute the loss on the test set without updating any gradients.

# A simplified function to calculate perplexity
def calculate_perplexity(model, tokenizer, data, device="cuda"):
    total_loss = 0
    total_tokens = 0

    for sample in tqdm(data):
        text = sample["dialogue"] + " " + sample["summary"]
        inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)

        input_ids = inputs.input_ids.to(device)
        target_ids = input_ids.clone()

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            loss = outputs.loss

        total_loss += loss.item() * input_ids.size(1)
        total_tokens += input_ids.size(1)

    avg_loss = total_loss / total_tokens
    perplexity = torch.exp(torch.tensor(avg_loss))

    return perplexity.item()

# Calculate perplexity on our test samples
perplexity_score = calculate_perplexity(model, tokenizer, test_samples)
print(f"Perplexity: {perplexity_score:.2f}")

A lower perplexity, for example a score of 15.43, would suggest the model has learned the patterns in the target domain well.

Qualitative Evaluation

Metrics provide a high-level summary, but they do not capture everything. A model might achieve a good ROUGE score but produce summaries that are factually incorrect or stylistically awkward. This is where qualitative analysis becomes necessary.

By manually inspecting the model's outputs, you can check for:

Coherence: Does the output make logical sense?
Relevance: Is the summary on-topic and does it capture the main points?
Factual Accuracy: Does the model invent details (hallucinate)?
Conciseness: Is the summary brief and to the point?

Let's examine a few examples side-by-side.

Input Dialogue	Reference Summary	Model's Generated Summary	Analysis
`Amanda: Can you pick up dinner tonight? <br> Ben: Sure, what do you want? <br> Amanda: Pizza would be great. <br> Ben: Ok, I'll grab it on my way home.`	Amanda asked Ben to get pizza for dinner on his way home.	Ben will pick up a pizza for dinner on his way home from work.	Good. The model correctly identifies the main points and produces a fluent, accurate summary.
`Chloe: The project deadline is Friday. <br> David: I know, I'm almost done with the report. I just need to add the final charts. <br> Chloe: Great, send it over when you're ready.`	David is finishing a report with charts for a project due on Friday.	David needs to finish his report. The deadline is Friday.	Fair. The summary is correct but misses the specific detail about the charts. It's less informative than the reference.
`Eve: I'm thinking of booking a flight to Paris. <br> Frank: Nice! When are you going? <br> Eve: Maybe in May. It's lovely there in the spring.`	Eve is planning a trip to Paris, possibly in May.	Eve is booking a flight to Paris for her vacation next week and will be staying at the Grand Hotel.	Poor (Hallucination). The model invented details not present in the dialogue (vacation, next week, Grand Hotel).

This manual review is an indispensable part of the evaluation loop. It provides insights that quantitative scores cannot, helping you identify specific failure modes that can be addressed by curating better training data or adjusting the fine-tuning process. By combining both quantitative and qualitative methods, you gain a well-rounded understanding of your model's true performance.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

ROUGE: A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, 2004 Text Summarization Branches Out (Association for Computational Linguistics) - Introduces the ROUGE metric, widely used for evaluating automatic summarization.
BLEU: a Method for Automatic Evaluation of Machine Translation, Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, 2002 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics) DOI: 10.3115/1073083.1073135 - Presents the BLEU score, a standard metric for assessing machine translation quality.
Speech and Language Processing (3rd Edition Draft), Daniel Jurafsky, James H. Martin, 2025 - Foundational textbook covering natural language processing concepts, including the definition and calculation of perplexity in Chapter 3.