N-gram Language Models: Bigrams and Trigrams

To solve the ambiguity between phrases like "recognize speech" and "wreck a nice beach," a language model must determine which word sequence is more probable. Calculating the probability of an entire sentence from scratch, however, is a complex task. For a sentence with $k$ words, $w_1, w_2, \dots, w_k$ , the chain rule of probability states:

P(w_1, w_2, \dots, w_k) = P(w_1) \times P(w_2 | w_1) \times P(w_3 | w_1, w_2) \times \dots \times P(w_k | w_1, \dots, w_{k-1})

As you can see, the context required to predict each subsequent word grows longer and longer. This approach is not only computationally intensive but also requires an impossible amount of data to accurately estimate the probability for every conceivable word history.

N-gram models provide a practical solution by making a simplifying assumption: the probability of a word depends only on a small, fixed number of preceding words, not the entire sequence. This is known as the Markov assumption.

The Bigram Model (N=2)

The most straightforward N-gram model is the bigram model, where we assume the probability of a word depends only on the single word that came before it. A sequence of two adjacent words is called a bigram.

The probability of a word $w_i$ given the previous word $w_{i-1}$ is written as $P(w_i | w_{i-1})$ . To find the probability of a whole sentence, we multiply the probabilities of its constituent bigrams.

Let's apply this to our example, "recognize speech":

P(\text{recognize speech}) \approx P(\text{recognize}) \times P(\text{speech} | \text{recognize})

We estimate these probabilities by counting word occurrences in a massive dataset of text, known as a corpus. The formula for the conditional probability is:

P(\text{speech} | \text{recognize}) = \frac{\text{Count}(\text{"recognize speech"})}{\text{Count}(\text{"recognize"})}

If "recognize" appears 1,000 times in our corpus and is followed by "speech" 400 times, then $P(\text{speech} | \text{recognize}) = 400 / 1000 = 0.4$ . If it's followed by "a" only 5 times, then $P(\text{a} | \text{recognize}) = 5 / 1000 = 0.005$ . A bigram model learns that "speech" is a much more likely word to follow "recognize" than "a".

Probabilities derived from a text corpus. The phrase "recognize speech" is a more common pairing than "nice beach," making it more probable in a bigram model.

The Trigram Model (N=3)

A bigram model helps, but sometimes one word of context isn't enough. A trigram model extends the context by assuming the probability of a word depends on the two preceding words. A sequence of three adjacent words is called a trigram.

The probability is written as $P(w_i | w_{i-2}, w_{i-1})$ .

Let's look at the phrase "a nice beach". A trigram model would calculate the probability of "beach" based on the two words before it:

P(\text{beach} | \text{a nice}) = \frac{\text{Count}(\text{"a nice beach"})}{\text{Count}(\text{"a nice"})}

This additional context is powerful. The word "nice" can be followed by many things ("nice day", "nice car", "nice person"). However, the sequence "a nice" provides a stronger signal that "beach" might be coming next, compared to what a bigram model sees with only "nice". This helps the ASR system make a more informed choice.

Dependency diagrams for a bigram and a trigram model. The trigram model uses a wider two-word context to predict the next word, which can capture more detailed language patterns.

Exploring Trigrams and the Sparsity Problem

You can extend this pattern to 4-grams (using 3 words of context), 5-grams (using 4 words), and so on. However, as N increases, a significant problem emerges: data sparsity.

For a bigram like "recognize speech", you might find thousands of examples in a text corpus. For a trigram like "to recognize speech", you'll find fewer examples. For a 5-gram like "I am trying to recognize speech", you might find very few, or even zero, instances. When a specific N-gram doesn't appear in the training data, its count is zero, and its probability cannot be calculated properly. Because of this trade-off, bigram and trigram models have historically been a sweet spot between context and reliability.

A Note on Starting Sentences

How do you calculate the probability of the very first word in a sentence? It has no preceding words. To solve this, a special "start-of-sentence" token, often written as <s>, is added to the beginning of every sentence in the training data. The probability of the first word "wreck" is then calculated as a bigram $P(\text{wreck} | \text{<s>})$ . Similarly, the probability of the second word "a" would be calculated as a trigram $P(\text{a} | \text{<s>}, \text{wreck})$ . This ensures every word has the required amount of context.

Was this section helpful?

References

Speech and Language Processing, Daniel Jurafsky and James H. Martin, 2025 (Stanford University (online edition)) - A comprehensive and widely-used textbook that provides a detailed explanation of N-gram language models, their mathematical foundations, and applications in speech recognition.
CS224N: Natural Language Processing with Deep Learning - Lecture 2: N-gram Models and Neural Network Models, Christopher Manning, 2024 (Stanford University) - Lecture slides from a leading university course covering N-gram models as foundational concepts in natural language processing.
Foundations of Statistical Natural Language Processing, Christopher Manning and Hinrich Schütze, 1999 (The MIT Press) - A seminal textbook presenting statistical methods, including N-gram models, that underpin much of early natural language processing research.