Reliable and automated methods are necessary to evaluate a fine-tuned small language model. Manual testing is useful for spot checks, but a formalized evaluation script running against a holdout dataset ensures consistent and objective benchmarking. This script computes automated metrics like ROUGE and perplexity, providing a clear numerical representation of how much the model improved over its baseline.
To build an automated evaluation pipeline, you will use the evaluate library from the Hugging Face ecosystem. This library provides standardized, tested implementations for many natural language processing metrics. You will also need your holdout dataset. A holdout dataset is a subset of your training data that the model has never seen before. Evaluating against this unseen data is the only reliable way to detect overfitting.
Automated evaluation pipeline flow processing a holdout dataset through inference and metric calculation.
First, import the required modules and load your fine-tuned model alongside its tokenizer. You will also load the holdout dataset from a local JSON file.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import evaluate
from datasets import load_dataset
# Load test dataset
dataset = load_dataset("json", data_files="test_holdout.json", split="train")
# Load model and tokenizer
model_path = "./fine-tuned-slm-lora"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype=torch.float16
)
# Load evaluation metrics
rouge = evaluate.load("rouge")
perplexity = evaluate.load("perplexity", module_type="metric")
The ROUGE metric measures overlap between the generated text and a reference text. To compute this, your script must iterate through the holdout dataset, pass each prompt to the model, and store the generated output.
Batch processing is highly recommended when evaluating large datasets to prevent memory overflow. For this script, we will iterate sequentially for clarity. We will capture the output and strip away the prompt tokens so only the generated response remains.
predictions = []
references = []
for item in dataset:
prompt = item["prompt"]
expected_output = item["response"]
# Tokenize the input prompt
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Generate response
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id
)
# Decode the text, ignoring the original prompt tokens
input_length = inputs.input_ids.shape[1]
generated_text = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
predictions.append(generated_text.strip())
references.append(expected_output.strip())
# Compute ROUGE
rouge_results = rouge.compute(predictions=predictions, references=references)
print("ROUGE Results:", rouge_results)
When you run this segment, the script accumulates the generated responses in the predictions list and the true expected outputs in the references list. The rouge.compute function then calculates the overlap.
Perplexity requires a different approach. Instead of comparing a generated string to a reference string, perplexity measures how well the model predicts the correct sequence of tokens based on the underlying probability distribution. As a reminder, the formula for perplexity evaluates the negative log-likelihood of the token sequence:
To calculate this using the evaluate library, you must provide the full text sequences containing both the prompt and the expected response. The metric will pass these sequences through the model to compute the log probabilities.
# Combine prompt and response into full sequences
full_texts = [f"{p} {r}" for p, r in zip(dataset["prompt"], dataset["response"])]
ppl_results = perplexity.compute(
model_id=model_path,
add_start_token=False,
predictions=full_texts
)
print("Mean Perplexity:", ppl_results["mean_perplexity"])
Once the evaluation script finishes execution, you will receive numerical values for ROUGE and perplexity. A lower perplexity score indicates that the model is more confident in predicting the reference dataset. A higher ROUGE score indicates better overlap between the generated text and the reference text.
Expected improvement in ROUGE metrics after completing the fine-tuning process compared to the base model.
By saving these scripts, you establish a repeatable baseline. If you decide to adjust training hyperparameters, change your LoRA rank configuration, or increase your dataset size, you can rerun this exact script to verify whether those changes resulted in a measurable improvement.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•