N-gram models operate by predicting the next word based on its predecessors. To make these predictions useful, a concrete probability must be assigned to them. These probabilities are not arbitrary; they are learned directly from large quantities of text data. This process turns linguistic patterns observed in reality into a mathematical model that can score the likelihood of a given phrase.
The foundation of a traditional language model is a corpus, which is simply a large collection of text. This text can come from anywhere: books, news articles, websites, or transcribed speeches. The core assumption is that the patterns of language in the corpus are representative of the language we want our ASR system to understand.
The method for calculating probabilities is based on a straightforward idea: we estimate the probability of an event by counting how often it occurs in our data. This is known as Maximum Likelihood Estimation (MLE). For a language model, this means counting words and word sequences.
Let's see how this works with a bigram model, where the probability of a word depends only on the single word that comes before it. The formula is quite intuitive:
P(wn∣wn−1)=Count(wn−1)Count(wn−1,wn)In plain English, the probability of seeing a word (wn) after another word (wn−1) is the number of times we saw that specific pair together, divided by the total number of times we saw the first word.
Imagine our entire corpus is just these three simple sentences:
Let's calculate the probability of the word "sat" appearing after "cat", or P(sat∣cat).
So, based on our tiny corpus, there is a 50% chance the word "sat" will follow the word "cat".
Now let's calculate P(on∣sat):
In our limited corpus, the word "on" always follows the word "sat", so the probability is 1.0. The diagram below illustrates this process of deriving probabilities from a text corpus.
From a text corpus to a table of bigram probabilities.
Once we have the probabilities for individual N-grams, we can calculate the probability of an entire sequence of words. We do this by multiplying the probabilities of each part of the sequence together. This is an application of the chain rule of probability.
For a bigram model, the probability of a sentence W=(w1,w2,...,wk) is calculated as:
P(W)=P(w1)×P(w2∣w1)×P(w3∣w2)×⋯×P(wk∣wk−1)To handle the first word, which has no predecessor, it's common practice to add a special start-of-sentence token (often written as <s>) to the beginning of every sentence in the corpus. The calculation then becomes:
Using our example, let's find the probability of the sentence "the cat sat". Assuming <s> appears 3 times (once for each sentence):
The total probability is P("the cat sat")=1.0×0.67×0.5=0.335. A higher final probability suggests a more likely sentence.
There is a significant problem with this simple counting approach. What happens if we encounter a word sequence that never appeared in our training corpus?
For instance, using our corpus, what is P(chased∣dog)?
The pair "dog chased" never occurs, so its count is 0. This makes the probability 0/2=0. When we multiply this zero into our chain of probabilities for a sentence, the entire sentence probability becomes zero. This implies the sentence is impossible, which is too extreme. An unseen event is unlikely, but not impossible.
This is known as the zero-frequency problem. The solution is smoothing. The simplest form is Laplace smoothing, or add-one smoothing. We pretend we have seen every possible bigram one extra time. To keep the probabilities valid, we also adjust the denominator.
The formula for add-one smoothing is:
P(wn∣wn−1)=Count(wn−1)+VCount(wn−1,wn)+1Here, V is the size of the vocabulary, which is the number of unique words in our corpus. In our example, the vocabulary is {the, cat, sat, on, mat, dog, rug, chased, mouse}. So, V=9.
Let's recalculate P(chased∣dog) with add-one smoothing:
Now, the unseen pair has a small but non-zero probability. This makes our model more resilient to new combinations of words, which is essential for handling the variability of natural language. While more advanced smoothing techniques exist (like Kneser-Ney), add-one smoothing illustrates the fundamental solution to the problem of zero counts.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with