In the preceding chapters, we constructed acoustic models that map audio features to sequences of character probabilities. While these models are effective at identifying phonetic content, their output can be acoustically plausible yet linguistically incorrect. For instance, a model might transcribe "recognize speech" as the phonetically similar "wreck a nice beach".
This is where a language model (LM) is introduced. An LM scores a sequence of words based on its grammatical structure and the likelihood of its occurrence, helping the system distinguish between sensible and nonsensical transcriptions. The process of using an LM to guide the selection of the final text from the acoustic model's predictions is called decoding. A decoder's objective is to find the word sequence W that maximizes a combined score, which is often a weighted sum of the acoustic and language model probabilities:
score(W)=logPAcoustic(X∣W)+αlogPLanguage Model(W)Here, PAcoustic(X∣W) is the probability assigned by the acoustic model to the audio features X given the word sequence W. The term PLanguage Model(W) is the probability of the word sequence itself, and α is a weight that balances the influence of the two models.
This chapter covers the theory and practice of integrating language models into an ASR system. You will learn to:
5.1 The Function of Language Models in ASR
5.2 N-gram Language Models
5.3 Building an N-gram Model with KenLM
5.4 Decoding Graphs for Model Integration
5.5 Decoding Algorithms: Greedy Search vs Beam Search
5.6 Implementing Beam Search with a Language Model
5.7 Hands-on Practical: Integrating a Language Model into a CTC Decoder
© 2026 ApX Machine LearningEngineered with