Building an N-gram Model with KenLM

While acoustic models are proficient at deciphering sounds, they lack an understanding of language. To prevent nonsensical transcriptions like "wreck a nice beach," we introduce a language model that scores word sequences based on their statistical likelihood. An effective and widely used approach for this is the n-gram model, which we can build efficiently using the KenLM toolkit.

Statistical Language Modeling with N-grams

An n-gram model operates on a simple but powerful premise known as the Markov assumption. Instead of calculating the probability of a word based on the entire history of words that came before it, which is computationally intractable, we approximate this probability by looking at only the last $n-1$ words.

For a sequence of words $W = w_1, w_2, ..., w_m$ , the probability is:

P(W) = P(w_1) P(w_2|w_1) P(w_3|w_1, w_2) ... P(w_m|w_1, ..., w_{m-1})

An n-gram model simplifies this calculation. For example:

A unigram model ( $n=1$ ) assumes each word is independent: $P(W) \approx \prod_i P(w_i)$ .
A bigram model ( $n=2$ ) conditions each word on the single preceding word: $P(W) \approx \prod_i P(w_i|w_{i-1})$ .
A trigram model ( $n=3$ ) conditions each word on the two preceding words: $P(W) \approx \prod_i P(w_i|w_{i-2}, w_{i-1})$ .

These probabilities are learned by counting occurrences in a large body of text, called a text corpus. For a bigram model, the probability $P(\text{speech} | \text{recognize})$ is estimated by counting how many times the phrase "recognize speech" appears in the corpus and dividing it by the total number of times "recognize" appears.

Introducing the KenLM Toolkit

KenLM is a highly optimized library specifically designed for creating, storing, and querying n-gram language models. It is favored in many speech recognition systems for two primary reasons: speed and memory efficiency. It can process gigabytes of text to produce a language model in a compact binary format that can be queried with very low latency, which is essential for building a responsive ASR system.

The Language Model Building Pipeline

Creating a language model with KenLM involves a clear, three-step process: acquiring and cleaning a text corpus, building the model, and converting it to an efficient binary format.

The process of creating a KenLM language model, from raw text to a final, optimized binary file.

1. Preparing the Training Corpus

The quality of your language model depends entirely on the quality and relevance of your text corpus. For general-purpose ASR, a large corpus from sources like Wikipedia or public domain books is a good starting point. For domain-specific applications, like transcribing medical notes, you would use a corpus of medical documents.

Before feeding the text to KenLM, it must be cleaned and normalized. This typically involves:

Converting all text to lowercase.
Removing punctuation.
Expanding contractions (e.g., "don't" -> "do not").
Normalizing numbers (e.g., "10" -> "ten").

Here is a simple Python script to perform basic cleaning on a file raw_text.txt and save the result to corpus.txt.

import re

# A simple set of words to create the vocabulary
# In a real system, you would generate this from your training data transcripts
vocab = {"the", "a", "an", "is", "of", "and", "recognize", "speech", "wreck", "nice", "beach", "it", "was"}

def normalize_text(text):
    text = text.lower()
    # Keep only alphabetic characters and spaces
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize and filter based on vocabulary
    words = [word for word in text.split() if word in vocab]
    return " ".join(words)

with open('raw_text.txt', 'r') as infile, open('corpus.txt', 'w') as outfile:
    for line in infile:
        cleaned_line = normalize_text(line)
        if cleaned_line: # Avoid writing empty lines
            outfile.write(cleaned_line + '\n')

print("Corpus preprocessing complete. Output saved to corpus.txt")

This script ensures that the text fed into KenLM is consistent and free of characters that would be misinterpreted as words. The vocabulary filtering step is important for ensuring the LM only contains words the acoustic model can produce.

2. Building the ARPA Model

Once the corpus is prepared, you can use KenLM's lmplz command to build the n-gram model. The tool reads your processed text and outputs the model in a standard text-based format called ARPA.

Let's build a trigram ( $n=3$ ) model. The -o flag specifies the order of the n-gram.

# Assuming KenLM is installed and in your PATH
lmplz -o 3 < corpus.txt > model.arpa

This command will stream the corpus.txt file into lmplz, which will count the n-grams and compute their probabilities. The output is saved to model.arpa. An ARPA file is human-readable and lists the log-probabilities and backoff weights for all learned unigrams, bigrams, and trigrams.

3. Creating an Efficient Binary Model

While the ARPA format is useful for inspection, it is not efficient for fast lookups during decoding. KenLM provides another tool, build_binary, to convert the .arpa file into a compressed binary format that significantly reduces memory usage and increases query speed.

build_binary model.arpa model.binary

The resulting model.binary file is the one you will use within your ASR decoder.

Querying the Language Model

To see the model in action, you can use KenLM's query tool. This allows you to score a sentence and see the log10 probability the model assigns to it. A higher score (i.e., less negative) indicates a more probable sentence according to the corpus.

Let's test our running example.

$ echo "recognize speech" | query model.binary
-1.875333  recognize speech   p: -0.983 -0.892 </s> -0.521

$ echo "wreck a nice beach" | query model.binary
-4.129871  wreck a nice beach   p: -1.341 -1.112 -0.998 -0.678 </s> -0.521

The output shows a total log10 probability for each sentence. As expected, "recognize speech" receives a much better score ( $-1.87$ ) than "wreck a nice beach" ( $-4.12$ ). During decoding, this difference is exactly what helps the system choose the correct transcription when the acoustic evidence is ambiguous. With this model.binary file in hand, we are now ready to integrate it into a decoding algorithm to improve our ASR system's accuracy.

Was this section helpful?

References

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky and James H. Martin, 2025 (Online manuscript) - A definitive textbook providing comprehensive coverage of n-gram language models, their mathematical foundations, and practical applications in speech recognition.
KenLM: Faster and Smaller Language Model Queries, Kenneth Heafield, 2011 Proceedings of the Sixth Workshop on Statistical Machine Translation (Association for Computational Linguistics) DOI: 10.3115/v1/W11-2114 - The original research paper introducing KenLM, detailing its efficient algorithms for n-gram language model construction and querying, which are crucial for large-scale speech recognition systems.