Quantitative Metrics for NLP Tasks

Human evaluation of generated text is often slow and subjective. To measure progress across multiple epochs or compare different model checkpoints objectively, automated quantitative metrics are necessary. These mathematical scoring methods evaluate how confidently a model predicts tokens and how closely its outputs match expected reference texts.

Measuring Model Confidence with Perplexity

Perplexity is the most fundamental metric for language modeling. It measures how surprised a model is by a sequence of words. A lower perplexity indicates the model assigns a higher probability to the true text, meaning it predicts the sequence more accurately.

Mathematically, it is the exponentiated average negative log-likelihood of a sequence. If $N$ represents the number of tokens in a sequence, perplexity measures how well the model predicts that sample using the following formula:

$PP(W) = \exp \left( -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_{<i}) \right)$

In this equation, $P(w_i | w_{<i})$ is the probability the model assigns to the $i$ -th token given all preceding tokens. If the model is highly confident and correct, this probability approaches 1, the log approaches 0, and the overall perplexity approaches 1. If the model struggles to guess the next token, the probability drops, making the log term more negative and driving the perplexity higher. During fine-tuning, you will typically compute perplexity on your validation dataset to ensure the model is learning the structure of your specific task.

Validation perplexity decreasing across five training epochs, indicating the model is becoming more confident in predicting the evaluation dataset.

Comparing Outputs with ROUGE

While perplexity tells you about the internal probability distribution of the model, it does not directly measure the quality of the final generated text. For tasks like instruction following, summarization, or question answering, you need to compare the generated output against a human-written reference.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics designed to evaluate automatic summarization and machine translation software in natural language processing. It calculates the overlap of n-grams between the generated text and the reference text.

There are several variants of ROUGE that you will use to evaluate your small language model:

ROUGE-1: Measures the overlap of unigrams (single words).
ROUGE-2: Measures the overlap of bigrams (two-word phrases). This helps assess whether the model is capturing short contextual phrases.
ROUGE-L: Measures the longest common subsequence. This evaluates sentence structure and word order, rewarding the model for maintaining the correct sequence of words even if they are interspersed with other tokens.

Extraction and matching of unigrams between a reference string and a generated string for ROUGE-1 calculation.

For each ROUGE variant, the score is typically broken down into three components. Recall measures the proportion of words in the reference text that the model managed to generate. Precision measures the proportion of words in the generated text that were actually relevant and present in the reference. The F1-score is the harmonic mean of precision and recall, providing a single balanced metric.

$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

Implementing Automated Evaluation

In Python, you can compute these metrics using the evaluate library from the Hugging Face ecosystem. This approach standardizes the calculation, ensuring your results are directly comparable to other machine learning projects.

import evaluate

# Load the ROUGE evaluation module
rouge = evaluate.load("rouge")

# Define your model's output and the expected reference
predictions = ["the small language model generates accurate text"]
references = ["the fine-tuned language model produces accurate text"]

# Compute the scores
results = rouge.compute(predictions=predictions, references=references)
print(results)

Running this script outputs a dictionary containing the ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum scores. By computing these numbers on a dedicated holdout dataset, you establish a concrete baseline for your model. If you modify your hyperparameters and train a second adapter, comparing the ROUGE and perplexity scores will tell you objectively whether the new version is an improvement over the original.

References

Speech and Language Processing, Daniel Jurafsky, James H. Martin, 2024 (Stanford University) - A standard textbook that provides a detailed explanation of N-gram language models and the mathematical foundations of perplexity.
ROUGE: A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, 2004 Text Summarization Branches Out (Association for Computational Linguistics) - The original paper introducing the ROUGE metric suite for evaluating automated text generation.
Holistic Evaluation of Language Models, Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel J. Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda, 2022 Transactions on Machine Learning Research (TMLR) DOI: 10.48550/arXiv.2211.09110 - A comprehensive study on model evaluation that contextualizes quantitative metrics within broader benchmarking frameworks.