Hands-On Practical: Running Evaluation Scripts

Reliable and automated methods are necessary to evaluate a fine-tuned small language model. Manual testing is useful for spot checks, but a formalized evaluation script running against a holdout dataset ensures consistent and objective benchmarking. This script computes automated metrics like ROUGE and perplexity, providing a clear numerical representation of how much the model improved over its baseline.

Preparing the Evaluation Environment

To build an automated evaluation pipeline, you will use the evaluate library from the Hugging Face ecosystem. This library provides standardized, tested implementations for many natural language processing metrics. You will also need your holdout dataset. A holdout dataset is a subset of your training data that the model has never seen before. Evaluating against this unseen data is the only reliable way to detect overfitting.

Automated evaluation pipeline flow processing a holdout dataset through inference and metric calculation.

First, import the required modules and load your fine-tuned model alongside its tokenizer. You will also load the holdout dataset from a local JSON file.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import evaluate
from datasets import load_dataset

# Load test dataset
dataset = load_dataset("json", data_files="test_holdout.json", split="train")

# Load model and tokenizer
model_path = "./fine-tuned-slm-lora"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.float16
)

# Load evaluation metrics
rouge = evaluate.load("rouge")
perplexity = evaluate.load("perplexity", module_type="metric")

Computing ROUGE Scores

The ROUGE metric measures overlap between the generated text and a reference text. To compute this, your script must iterate through the holdout dataset, pass each prompt to the model, and store the generated output.

Batch processing is highly recommended when evaluating large datasets to prevent memory overflow. For this script, we will iterate sequentially for clarity. We will capture the output and strip away the prompt tokens so only the generated response remains.

predictions = []
references = []

for item in dataset:
    prompt = item["prompt"]
    expected_output = item["response"]

    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=50,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode the text, ignoring the original prompt tokens
    input_length = inputs.input_ids.shape[1]
    generated_text = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)

    predictions.append(generated_text.strip())
    references.append(expected_output.strip())

# Compute ROUGE
rouge_results = rouge.compute(predictions=predictions, references=references)
print("ROUGE Results:", rouge_results)

When you run this segment, the script accumulates the generated responses in the predictions list and the true expected outputs in the references list. The rouge.compute function then calculates the overlap.

Computing Perplexity

Perplexity requires a different approach. Instead of comparing a generated string to a reference string, perplexity measures how well the model predicts the correct sequence of tokens based on the underlying probability distribution. As a reminder, the formula for perplexity evaluates the negative log-likelihood of the token sequence:

$PP(W) = \exp \left( -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_{<i}) \right)$

To calculate this using the evaluate library, you must provide the full text sequences containing both the prompt and the expected response. The metric will pass these sequences through the model to compute the log probabilities.

# Combine prompt and response into full sequences
full_texts = [f"{p} {r}" for p, r in zip(dataset["prompt"], dataset["response"])]

ppl_results = perplexity.compute(
    model_id=model_path,
    add_start_token=False,
    predictions=full_texts
)
print("Mean Perplexity:", ppl_results["mean_perplexity"])

Interpreting Evaluation Metrics

Once the evaluation script finishes execution, you will receive numerical values for ROUGE and perplexity. A lower perplexity score indicates that the model is more confident in predicting the reference dataset. A higher ROUGE score indicates better overlap between the generated text and the reference text.

Expected improvement in ROUGE metrics after completing the fine-tuning process compared to the base model.

By saving these scripts, you establish a repeatable baseline. If you decide to adjust training hyperparameters, change your LoRA rank configuration, or increase your dataset size, you can rerun this exact script to verify whether those changes resulted in a measurable improvement.

References

Evaluate Documentation, Hugging Face, 2024 - Official documentation for the Hugging Face library used in the practical script for calculating metrics.
ROUGE: A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, 2004 Text Summarization Branches Out (Association for Computational Linguistics) - The foundational paper introducing the ROUGE metric suite for automated evaluation of text generation.
Speech and Language Processing, Daniel Jurafsky, James H. Martin, 2026 (Pearson) - A comprehensive textbook section explaining the mathematical foundations of language modeling and perplexity (3rd Edition Draft).
A Survey on Evaluation of Large Language Models, Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie, 2024 ACM Transactions on Intelligent Systems and Technology, Vol. 15 (ACM) DOI: 10.1145/3641289 - A recent overview of methodologies and benchmarks for assessing the performance of language models.