Language models assign probabilities to sequences of words. A fundamental question in their application is how to measure their effectiveness. For instance, when comparing different models, such as a bigram model and a trigram model, a quantitative method is needed to determine which one is better at predicting text. This evaluation is performed using a metric called perplexity.
At its core, perplexity is a measure of how well a probability model predicts a sample. For a language model, it measures how uncertain the model is when predicting the next word in a sequence. You can think of it as a measure of "surprise." A language model that is good at its job will be less "surprised" by a typical sentence from a test set. This lack of surprise translates to assigning a higher probability to that sentence.
A lower perplexity score indicates that the language model is better at predicting the text. A higher perplexity score indicates that the model is more "perplexed" by the text and, therefore, is not a good fit for it.
The most intuitive way to interpret a perplexity score is to think of it as the effective number of choices the model has for the next word.
For example, if a language model has a perplexity of 20 on a given test set, it means that on average, the model is as confused about predicting the next word as if it had to choose uniformly from 20 different words. A better model would have a lower perplexity, say 10, which would mean its uncertainty is equivalent to choosing between only 10 words.
A perfect model that always knows the next word would have a perplexity of 1 (it has only one choice). Of course, this is not possible in practice with natural language.
The following diagram illustrates this idea. A model with low perplexity has a smaller effective branching factor, meaning it has narrowed down the likely options more effectively than a model with high perplexity.
The low perplexity model on the left is more certain about the next word, while the high perplexity model on the right has a much larger set of roughly equally probable choices.
Perplexity is derived directly from the probability assigned to a test set by the language model. If a test set W consists of a sequence of words w1,w2,…,wN, the perplexity is calculated as the inverse of the geometric mean of the probabilities of the words in the sequence.
The formula is:
Perplexity(W)=P(w1,w2,…,wN)−N1Let's break this down:
In practice, because multiplying many small probabilities can lead to numerical underflow (the computer rounds the number to zero), calculations are often done using log probabilities. The formula using logs is equivalent but more stable for computers to handle.
Imagine you have two language models you want to compare for an ASR system designed to transcribe financial news.
Now, you evaluate both models on a test set of unseen financial news headlines. Let's take the sentence, "The Federal Reserve raised interest rates."
By comparing the perplexity scores, you can clearly conclude that Model A is the superior choice for your ASR application. This metric provides a formal way to confirm our intuition, making it an essential tool for developing and selecting language models.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with