Evaluating Text Generation Quality

Mathematical optimization through loss curve monitoring and checkpoint saving confirms a model is learning to predict the next token, but it does not guarantee high-quality text for a specific task. While a decreasing loss value is a positive indicator, it cannot measure coherence, tone, or formatting adherence. Before calculating automated metrics, qualitative evaluation is necessary to inspect the model outputs directly.

Mathematical metrics alone can be misleading. A model might achieve a low loss by memorizing common token sequences without learning the underlying structure of your instructions. By reading the generated text, you can identify logical inconsistencies, repetitive loops, and structural failures that formulas often miss. To do this effectively, you need a structured approach rather than typing random queries into the prompt.

Create a targeted validation set specifically for human review. This set should contain around twenty to fifty prompts divided into three categories. First, include standard cases that match the exact structure and topic of your training data. Second, include edge cases that use unexpected phrasing or ask for unusually complex responses. Third, include out-of-distribution cases, which are completely unrelated to your training data. Testing out-of-distribution prompts helps you determine if the fine-tuning process destroyed the general knowledge of the model.

Workflow for side-by-side qualitative evaluation comparing base and fine-tuned model outputs.

When you evaluate generation quality, the parameters you pass to the inference engine heavily influence the output. If your outputs look terrible, the issue might be your generation settings rather than the model weights. The temperature parameter scales the logits before the softmax function is applied. A lower temperature, such as $T = 0.1$ , makes the model more deterministic and confident in its top choices. This is highly recommended for structured tasks like JSON generation or data extraction. A higher temperature, like $T = 0.7$ , encourages diverse vocabulary for conversational tasks. You should also configure top-p sampling, which restricts the model to selecting from a dynamic pool of tokens whose cumulative probability exceeds the value $p$ .

During your inspection, you should actively look for specific failure modes common in fine-tuned small language models. The first issue is formatting failure. Check if the model stops generating when it should. If the model continues rambling after answering the prompt, it likely failed to learn the end-of-sequence token during training. Small language models are highly sensitive to prompt templates, and missing padding or sequence tokens in your dataset often cause this behavior.

The second issue is hallucination. Fine-tuning on a highly structured dataset can sometimes teach the model to prioritize formatting over factual accuracy. It learns the shape of a correct answer and will confidently insert fabricated information to fill that shape. The third issue is mode collapse. If your model returns the exact same template or answer for drastically different prompts, the learning rate was likely too high or the model trained for too many epochs, causing it to overfit to a single pattern.

To perform this evaluation efficiently, write a Python script that generates responses from both the original base model and your newly fine-tuned adapter side-by-side. By processing the exact same prompt through both models using identical generation parameters, you isolate the effect of your training data. You can observe exactly how the fine-tuning process altered the behavior of the model, ensuring the changes align with your project goals before you move on to automated benchmarking algorithms.

References

The Curious Case of Neural Text Degeneration, Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi, 2020 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1904.09751 - Introduces Nucleus (Top-p) sampling and explains why maximization-based decoding often leads to repetitive or low-quality text.
Holistic Evaluation of Language Models, Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, and Yiannis Pappas et al., 2022 arXiv preprint arXiv:2211.09110 DOI: 10.48550/arXiv.2211.09110 - A comprehensive framework for evaluating models across multiple metrics, including accuracy, calibration, and fairness.
Text Generation Strategies, Hugging Face, 2024 (Hugging Face) - Technical documentation explaining the implementation and effects of parameters like temperature and Top-p in common inference engines.
Survey of Hallucination in Natural Language Generation, Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung, 2023 ACM Computing Surveys, Vol. 55 DOI: 10.1145/3571730 - Categorizes different types of hallucinations and failure modes in generated text, providing a vocabulary for qualitative review.