Masterclass
After investing significant compute and data resources into training a large language model, the next step is to measure how well it actually performs its fundamental task: modeling language. Evaluating these complex systems requires specific techniques. One category of evaluation focuses directly on the model's ability to predict sequences of text, without necessarily testing it on downstream applications.
This chapter concentrates on these intrinsic evaluation methods. We will look at metrics derived directly from the model's probabilities assigned to text sequences. The most common intrinsic metric is perplexity, which quantifies how well a probability model predicts a sample. It is closely related to the cross-entropy loss used during training. A lower perplexity score generally indicates that the model is better at predicting the test data, meaning it assigns higher probabilities to the observed sequences. For a sequence W=w1,w2,...,wN, perplexity can be expressed based on the model's assigned probabilities p(wi∣w<i;θ) as:
PPL(W)=exp(−N1∑i=1Nlogp(wi∣w<i;θ))
Understanding perplexity provides a baseline assessment of the model's language modeling quality.
In this chapter, you will learn:
21.1 Concept of Language Model Evaluation
21.2 Perplexity: Definition and Calculation
21.3 Interpreting Perplexity Scores
21.4 Bits Per Character/Word
21.5 Effect of Tokenization on Perplexity
© 2025 ApX Machine Learning