An ASR system needs to balance acoustic evidence with linguistic plausibility. The acoustic model provides a stream of probabilities for characters or phonemes, but searching through every possible combination of these characters to form a sentence would be computationally impossible. The search space is simply too large. To make this search manageable, linguistic knowledge, including the lexicon and language model, is pre-compiled into a highly optimized structure known as a decoding graph.
A decoding graph is a finite-state network that represents the entire valid search space for our decoder. Think of it as a map where every possible path from a start node to an end node corresponds to a sentence that is permissible according to our language rules. By restricting the search to the paths within this graph, we drastically reduce the number of hypotheses we need to consider.
The graph is constructed by combining several sources of information. While the formal process involves composing Weighted Finite-State Transducers (WFSTs), we can understand it as layering different constraints.
The Lexicon (L): This is the system's dictionary. It maps words to their constituent units, which for modern end-to-end models are typically characters. For example, the lexicon defines that the word "speech" is composed of the sequence s-p-e-e-c-h. It acts as a fundamental constraint, ensuring that the decoder can only form valid words.
The Language Model (G): This is the grammar or statistical model of word sequences we discussed in the chapter introduction. An n-gram language model, for instance, provides the probability of a word given its preceding words. This information is encoded as weighted transitions between word states in the graph. A transition from "recognize" to "speech" would have a higher probability (and thus a more favorable weight) than a transition from "recognize" to "beach".
By combining the lexicon (L) and the language model (G), we create a new, more complex graph. In this combined graph, a path doesn't just represent a sequence of words; it represents the underlying sequence of characters that form those valid word sequences.
The final decoding graph contains a massive number of paths, each representing a potential transcription. The decoder's job is to find the single best path through this graph, guided by the output of the acoustic model.
Here is how the process works at a high level:
This structure elegantly integrates all our knowledge. The graph itself enforces the lexicon and language model rules, while the acoustic model scores guide the search in real-time. This prevents the decoder from ever exploring nonsensical character sequences or grammatically bizarre sentences, making the search for the optimal transcription efficient and effective.
The diagram below illustrates a simplified portion of a decoding graph. The nodes represent words, and the edges represent the transitions permitted by the language model. A decoder would traverse this graph from left to right, finding a path like "recognize speech" that is both acoustically and linguistically probable. The alternative path "wreck a nice beach," while acoustically similar, would be penalized by the low probabilities of its word-to-word transitions in a well-trained language model.
A simplified view of a decoding search graph. Thicker, green lines indicate a high-probability path favored by the language model. Thinner, red lines show a less likely alternative. The decoder searches for the path that best balances these LM probabilities with the acoustic model's output.
This pre-compiled graph is the foundation upon which decoding algorithms operate. In the next sections, we will examine two such algorithms, Greedy Search and Beam Search, to see how they traverse this structure to produce the final transcription.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with