Human evaluation of generated text is often slow and subjective. To measure progress across multiple epochs or compare different model checkpoints objectively, automated quantitative metrics are necessary. These mathematical scoring methods evaluate how confidently a model predicts tokens and how closely its outputs match expected reference texts.
Perplexity is the most fundamental metric for language modeling. It measures how surprised a model is by a sequence of words. A lower perplexity indicates the model assigns a higher probability to the true text, meaning it predicts the sequence more accurately.
Mathematically, it is the exponentiated average negative log-likelihood of a sequence. If represents the number of tokens in a sequence, perplexity measures how well the model predicts that sample using the following formula:
In this equation, is the probability the model assigns to the -th token given all preceding tokens. If the model is highly confident and correct, this probability approaches 1, the log approaches 0, and the overall perplexity approaches 1. If the model struggles to guess the next token, the probability drops, making the log term more negative and driving the perplexity higher. During fine-tuning, you will typically compute perplexity on your validation dataset to ensure the model is learning the structure of your specific task.
Validation perplexity decreasing across five training epochs, indicating the model is becoming more confident in predicting the evaluation dataset.
While perplexity tells you about the internal probability distribution of the model, it does not directly measure the quality of the final generated text. For tasks like instruction following, summarization, or question answering, you need to compare the generated output against a human-written reference.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics designed to evaluate automatic summarization and machine translation software in natural language processing. It calculates the overlap of n-grams between the generated text and the reference text.
There are several variants of ROUGE that you will use to evaluate your small language model:
Extraction and matching of unigrams between a reference string and a generated string for ROUGE-1 calculation.
For each ROUGE variant, the score is typically broken down into three components. Recall measures the proportion of words in the reference text that the model managed to generate. Precision measures the proportion of words in the generated text that were actually relevant and present in the reference. The F1-score is the harmonic mean of precision and recall, providing a single balanced metric.
In Python, you can compute these metrics using the evaluate library from the Hugging Face ecosystem. This approach standardizes the calculation, ensuring your results are directly comparable to other machine learning projects.
import evaluate
# Load the ROUGE evaluation module
rouge = evaluate.load("rouge")
# Define your model's output and the expected reference
predictions = ["the small language model generates accurate text"]
references = ["the fine-tuned language model produces accurate text"]
# Compute the scores
results = rouge.compute(predictions=predictions, references=references)
print(results)
Running this script outputs a dictionary containing the ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum scores. By computing these numbers on a dedicated holdout dataset, you establish a concrete baseline for your model. If you modify your hyperparameters and train a second adapter, comparing the ROUGE and perplexity scores will tell you objectively whether the new version is an improvement over the original.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•