While deep learning models excel at learning from raw features, they operate primarily on acoustic information. To help them make linguistically sound decisions, we introduce a statistical language model. The most fundamental type is the n-gram model, which calculates the probability of a sequence of words, W, by making a simplifying assumption about how words relate to each other.
Instead of trying to calculate the probability of a word given the entire history of all preceding words, a computationally intensive and data-demanding task, an n-gram model assumes that the probability of a given word only depends on the previous n−1 words. This is an application of the Markov assumption.
The full probability of a word sequence W=(w1,w2,…,wk) can be written using the chain rule of probability:
P(W)=P(w1)P(w2∣w1)P(w3∣w1,w2)…P(wk∣w1,…,wk−1)An n-gram model approximates this by limiting the context window. For a given n, the probability of a word wi is approximated as:
P(wi∣w1,…,wi−1)≈P(wi∣wi−n+1,…,wi−1)Let's look at how this works for common values of n.
The simplest model, a unigram model, assumes each word is independent of all others. The probability of a word is just its frequency in the training text.
This is a "bag-of-words" approach. It completely ignores word order and context, making it too simplistic for generating coherent sentences. It cannot distinguish between "recognize speech" and "speech recognize".
A bigram model is a significant step up. It assumes the probability of a word depends only on the single preceding word.
Using our running example, a bigram model would evaluate:
In a large corpus of English text, the phrase "recognize speech" is far more common than "wreck a nice", so the model correctly assigns a higher probability to the first transcription.
A trigram model considers the two previous words, capturing more context.
This allows the model to learn longer-range dependencies, like the fact that the phrase "a nice beach" is more probable than "a nice speech". As n increases, the model can capture more context, but this comes at a cost. Higher-order n-gram models require much more training data and are more susceptible to issues we will discuss shortly. In practice, 3-gram and 4-gram models have historically provided a good balance for ASR systems.
N-gram probabilities are estimated directly from a large body of text called a corpus. The calculation is a straightforward application of Maximum Likelihood Estimation (MLE), which involves counting occurrences. For a bigram model, the conditional probability P(wi∣wi−1) is estimated as:
P(wi∣wi−1)=count(wi−1)count(wi−1,wi)For example, to calculate P(speech∣recognize), you would count every occurrence of the pair "recognize speech" in your corpus and divide it by the total number of times "recognize" appears.
The conditional probability of different words that might follow "recognize," based on a text corpus. A well-trained n-gram model assigns a significantly higher score to the linguistically plausible bigram "recognize speech" than to the improbable "recognize beach."
A major challenge with n-gram models is data sparsity. What happens if a perfectly valid bigram, like "transcribe audio," never appeared in your training corpus? According to the formula, its count is zero, so its probability is zero.
P(audio∣transcribe)=count("transcribe")count("transcribe audio")=count("transcribe")0=0This is problematic. A single unseen n-gram would cause the probability of the entire sentence to become zero, forcing the decoder to discard it, even if the rest of the sentence is highly probable.
The solution is smoothing. Smoothing techniques take a small amount of probability mass from the n-grams we have seen and redistribute it to the n-grams we have not. This ensures that no n-gram has a probability of exactly zero.
A simple method is Laplace (or Add-1) smoothing, where we add one to all n-gram counts. However, more sophisticated techniques are used in practice. Modern toolkits like KenLM, which we will use in the next section, implement advanced smoothing algorithms like Kneser-Ney smoothing, which provides much better estimates for the probabilities of low-frequency and unseen n-grams.
By understanding these statistical foundations, you can appreciate how an n-gram LM provides the linguistic constraints needed to guide the decoder. In the next section, we will move from theory to practice and build our own n-gram model using a standard toolkit.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with