Calculating Probabilities of Word Sequences

N-gram models operate by predicting the next word based on its predecessors. To make these predictions useful, a concrete probability must be assigned to them. These probabilities are not arbitrary; they are learned directly from large quantities of text data. This process turns linguistic patterns observed in reality into a mathematical model that can score the likelihood of a given phrase.

From Text Data to Probabilities

The foundation of a traditional language model is a corpus, which is simply a large collection of text. This text can come from anywhere: books, news articles, websites, or transcribed speeches. The core assumption is that the patterns of language in the corpus are representative of the language we want our ASR system to understand.

The method for calculating probabilities is based on a straightforward idea: we estimate the probability of an event by counting how often it occurs in our data. This is known as Maximum Likelihood Estimation (MLE). For a language model, this means counting words and word sequences.

Calculating Bigram Probabilities: A Simple Example

Let's see how this works with a bigram model, where the probability of a word depends only on the single word that comes before it. The formula is quite intuitive:

P(w_n | w_{n-1}) = \frac{\text{Count}(w_{n-1}, w_n)}{\text{Count}(w_{n-1})}

In plain English, the probability of seeing a word ( $w_n$ ) after another word ( $w_{n-1}$ ) is the number of times we saw that specific pair together, divided by the total number of times we saw the first word.

Imagine our entire corpus is just these three simple sentences:

"the cat sat on the mat"
"the dog sat on the rug"
"the cat chased the dog"

Let's calculate the probability of the word "sat" appearing after "cat", or $P(\text{sat} | \text{cat})$ .

Count the pair: The sequence "cat sat" appears 1 time.
Count the preceding word: The word "cat" appears 2 times in total ("the cat sat" and "the cat chased").
Calculate the probability: $P(\text{sat} | \text{cat}) = \frac{\text{Count}(\text{cat, sat})}{\text{Count}(\text{cat})} = \frac{1}{2} = 0.5$

So, based on our tiny corpus, there is a 50% chance the word "sat" will follow the word "cat".

Now let's calculate $P(\text{on} | \text{sat})$ :

Count the pair: The sequence "sat on" appears 2 times.
Count the preceding word: The word "sat" appears 2 times.
Calculate the probability: $P(\text{on} | \text{sat}) = \frac{\text{Count}(\text{sat, on})}{\text{Count}(\text{sat})} = \frac{2}{2} = 1.0$

In our limited corpus, the word "on" always follows the word "sat", so the probability is 1.0. The diagram below illustrates this process of deriving probabilities from a text corpus.

From a text corpus to a table of bigram probabilities.

Calculating the Probability of an Entire Sentence

Once we have the probabilities for individual N-grams, we can calculate the probability of an entire sequence of words. We do this by multiplying the probabilities of each part of the sequence together. This is an application of the chain rule of probability.

For a bigram model, the probability of a sentence $W = (w_1, w_2, ..., w_k)$ is calculated as:

P(W) = P(w_1) \times P(w_2 | w_1) \times P(w_3 | w_2) \times \dots \times P(w_k | w_{k-1})

To handle the first word, which has no predecessor, it's common practice to add a special start-of-sentence token (often written as <s>) to the beginning of every sentence in the corpus. The calculation then becomes:

P(W) = P(w_1 | \text{<s>}) \times P(w_2 | w_1) \times \dots \times P(w_k | w_{k-1})

Using our example, let's find the probability of the sentence "the cat sat". Assuming <s> appears 3 times (once for each sentence):

$P(\text{the} | \text{<s>}) = 3/3 = 1.0$
$P(\text{cat} | \text{the}) = 2/3 \approx 0.67$
$P(\text{sat} | \text{cat}) = 1/2 = 0.5$

The total probability is $P(\text{"the cat sat"}) = 1.0 \times 0.67 \times 0.5 = 0.335$ . A higher final probability suggests a more likely sentence.

The Problem of Zeros and Smoothing

There is a significant problem with this simple counting approach. What happens if we encounter a word sequence that never appeared in our training corpus?

For instance, using our corpus, what is $P(\text{chased} | \text{dog})$ ?

The pair "dog chased" never occurs, so its count is 0. This makes the probability $0/2 = 0$ . When we multiply this zero into our chain of probabilities for a sentence, the entire sentence probability becomes zero. This implies the sentence is impossible, which is too extreme. An unseen event is unlikely, but not impossible.

This is known as the zero-frequency problem. The solution is smoothing. The simplest form is Laplace smoothing, or add-one smoothing. We pretend we have seen every possible bigram one extra time. To keep the probabilities valid, we also adjust the denominator.

The formula for add-one smoothing is:

P(w_n | w_{n-1}) = \frac{\text{Count}(w_{n-1}, w_n) + 1}{\text{Count}(w_{n-1}) + V}

Here, $V$ is the size of the vocabulary, which is the number of unique words in our corpus. In our example, the vocabulary is {the, cat, sat, on, mat, dog, rug, chased, mouse}. So, $V=9$ .

Let's recalculate $P(\text{chased} | \text{dog})$ with add-one smoothing:

Count the pair + 1: Count("dog chased") is 0. So, $0 + 1 = 1$ .
Count the word + V: Count("dog") is 2. The vocabulary size $V$ is 9. So, $2 + 9 = 11$ .
Calculate the smoothed probability: $P(\text{chased} | \text{dog}) = \frac{1}{11} \approx 0.091$

Now, the unseen pair has a small but non-zero probability. This makes our model more resilient to new combinations of words, which is essential for handling the variability of natural language. While more advanced smoothing techniques exist (like Kneser-Ney), add-one smoothing illustrates the fundamental solution to the problem of zero counts.

Was this section helpful?

References

Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - The leading textbook for speech and natural language processing, offering comprehensive explanations of N-gram models, maximum likelihood estimation, and various smoothing techniques.
Foundations of Statistical Natural Language Processing, Christopher D. Manning, Hinrich Schütze, 1999 (MIT Press) - A classic textbook providing a statistical treatment of language models, including the mathematical basis of N-grams and maximum likelihood estimation.
Fundamentals of Speech Recognition, Lawrence R. Rabiner, Biing-Hwang Juang, 1993 (PTR Prentice Hall) - This book is a foundational text in speech recognition that covers language modeling and probability estimation essential for ASR systems.