While acoustic models are proficient at deciphering sounds, they lack an understanding of language. To prevent nonsensical transcriptions like "wreck a nice beach," we introduce a language model that scores word sequences based on their statistical likelihood. An effective and widely used approach for this is the n-gram model, which we can build efficiently using the KenLM toolkit.
An n-gram model operates on a simple but powerful premise known as the Markov assumption. Instead of calculating the probability of a word based on the entire history of words that came before it, which is computationally intractable, we approximate this probability by looking at only the last words.
For a sequence of words , the probability is:
An n-gram model simplifies this calculation. For example:
These probabilities are learned by counting occurrences in a large body of text, called a text corpus. For a bigram model, the probability is estimated by counting how many times the phrase "recognize speech" appears in the corpus and dividing it by the total number of times "recognize" appears.
KenLM is a highly optimized library specifically designed for creating, storing, and querying n-gram language models. It is favored in many speech recognition systems for two primary reasons: speed and memory efficiency. It can process gigabytes of text to produce a language model in a compact binary format that can be queried with very low latency, which is essential for building a responsive ASR system.
Creating a language model with KenLM involves a clear, three-step process: acquiring and cleaning a text corpus, building the model, and converting it to an efficient binary format.
The process of creating a KenLM language model, from raw text to a final, optimized binary file.
The quality of your language model depends entirely on the quality and relevance of your text corpus. For general-purpose ASR, a large corpus from sources like Wikipedia or public domain books is a good starting point. For domain-specific applications, like transcribing medical notes, you would use a corpus of medical documents.
Before feeding the text to KenLM, it must be cleaned and normalized. This typically involves:
Here is a simple Python script to perform basic cleaning on a file raw_text.txt and save the result to corpus.txt.
import re
# A simple set of words to create the vocabulary
# In a real system, you would generate this from your training data transcripts
vocab = {"the", "a", "an", "is", "of", "and", "recognize", "speech", "wreck", "nice", "beach", "it", "was"}
def normalize_text(text):
text = text.lower()
# Keep only alphabetic characters and spaces
text = re.sub(r'[^a-z\s]', '', text)
# Tokenize and filter based on vocabulary
words = [word for word in text.split() if word in vocab]
return " ".join(words)
with open('raw_text.txt', 'r') as infile, open('corpus.txt', 'w') as outfile:
for line in infile:
cleaned_line = normalize_text(line)
if cleaned_line: # Avoid writing empty lines
outfile.write(cleaned_line + '\n')
print("Corpus preprocessing complete. Output saved to corpus.txt")
This script ensures that the text fed into KenLM is consistent and free of characters that would be misinterpreted as words. The vocabulary filtering step is important for ensuring the LM only contains words the acoustic model can produce.
Once the corpus is prepared, you can use KenLM's lmplz command to build the n-gram model. The tool reads your processed text and outputs the model in a standard text-based format called ARPA.
Let's build a trigram () model. The -o flag specifies the order of the n-gram.
# Assuming KenLM is installed and in your PATH
lmplz -o 3 < corpus.txt > model.arpa
This command will stream the corpus.txt file into lmplz, which will count the n-grams and compute their probabilities. The output is saved to model.arpa. An ARPA file is human-readable and lists the log-probabilities and backoff weights for all learned unigrams, bigrams, and trigrams.
While the ARPA format is useful for inspection, it is not efficient for fast lookups during decoding. KenLM provides another tool, build_binary, to convert the .arpa file into a compressed binary format that significantly reduces memory usage and increases query speed.
build_binary model.arpa model.binary
The resulting model.binary file is the one you will use within your ASR decoder.
To see the model in action, you can use KenLM's query tool. This allows you to score a sentence and see the log10 probability the model assigns to it. A higher score (i.e., less negative) indicates a more probable sentence according to the corpus.
Let's test our running example.
$ echo "recognize speech" | query model.binary
-1.875333 recognize speech p: -0.983 -0.892 </s> -0.521
$ echo "wreck a nice beach" | query model.binary
-4.129871 wreck a nice beach p: -1.341 -1.112 -0.998 -0.678 </s> -0.521
The output shows a total log10 probability for each sentence. As expected, "recognize speech" receives a much better score () than "wreck a nice beach" (). During decoding, this difference is exactly what helps the system choose the correct transcription when the acoustic evidence is ambiguous. With this model.binary file in hand, we are now ready to integrate it into a decoding algorithm to improve our ASR system's accuracy.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with