Decoding Graphs for Model Integration

An ASR system needs to balance acoustic evidence with linguistic plausibility. The acoustic model provides a stream of probabilities for characters or phonemes, but searching through every possible combination of these characters to form a sentence would be computationally impossible. The search space is simply too large. To make this search manageable, linguistic knowledge, including the lexicon and language model, is pre-compiled into a highly optimized structure known as a decoding graph.

A decoding graph is a finite-state network that represents the entire valid search space for our decoder. Think of it as a map where every possible path from a start node to an end node corresponds to a sentence that is permissible according to our language rules. By restricting the search to the paths within this graph, we drastically reduce the number of hypotheses we need to evaluate.

The Building Blocks of a Search Graph

The graph is constructed by combining several sources of information. While the formal process involves composing Weighted Finite-State Transducers (WFSTs), we can understand it as layering different constraints.

The Lexicon (L): This is the system's dictionary. It maps words to their constituent units, which for modern end-to-end models are typically characters. For example, the lexicon defines that the word "speech" is composed of the sequence s-p-e-e-c-h. It acts as a fundamental constraint, ensuring that the decoder can only form valid words.
The Language Model (G): This is the grammar or statistical model of word sequences we discussed in the chapter introduction. An n-gram language model, for instance, provides the probability of a word given its preceding words. This information is encoded as weighted transitions between word states in the graph. A transition from "recognize" to "speech" would have a higher probability (and thus a more favorable weight) than a transition from "recognize" to "beach".

By combining the lexicon (L) and the language model (G), we create a new, more complex graph. In this combined graph, a path doesn't just represent a sequence of words; it represents the underlying sequence of characters that form those valid word sequences.

Navigating the Graph During Decoding

The final decoding graph contains a massive number of paths, each representing a potential transcription. The decoder's job is to find the single best path through this graph, guided by the output of the acoustic model.

Here is how the process works at a high level:

The search algorithm starts at the initial node of the graph.
For each time step of the input audio, the acoustic model outputs a probability distribution over all possible characters (e.g., 'a', 'b', 'c', ...).
The decoder uses these acoustic probabilities to extend the most promising paths through the graph. For instance, if the acoustic model is highly confident about the character 's' at the current time step, paths that transition on an 's' edge are prioritized.
The total score of any given path is a combination of two things: the accumulated acoustic score from the model's predictions and the language model score, which is embedded in the weights of the graph's edges.

This structure elegantly integrates all our knowledge. The graph itself enforces the lexicon and language model rules, while the acoustic model scores guide the search in real-time. This prevents the decoder from ever exploring nonsensical character sequences or grammatically bizarre sentences, making the search for the optimal transcription efficient and effective.

The diagram below illustrates a simplified portion of a decoding graph. The nodes represent words, and the edges represent the transitions permitted by the language model. A decoder would traverse this graph from left to right, finding a path like "recognize speech" that is both acoustically and linguistically probable. The alternative path "wreck a nice beach," while acoustically similar, would be penalized by the low probabilities of its word-to-word transitions in a well-trained language model.

A simplified view of a decoding search graph. Thicker, green lines indicate a high-probability path favored by the language model. Thinner, red lines show a less likely alternative. The decoder searches for the path that best balances these LM probabilities with the acoustic model's output.

This pre-compiled graph is the foundation upon which decoding algorithms operate. In the next sections, we will examine two such algorithms, Greedy Search and Beam Search, to see how they traverse this structure to produce the final transcription.

Was this section helpful?

References

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky, James H. Martin, 2025 (Pearson) - Textbook covering ASR fundamentals, language modeling, and finite-state transducers for decoding graphs.
Weighted Finite-State Transducers in Speech Recognition, Mehryar Mohri, Fernando Pereira, and Michael Riley, 2002 Computer Speech & Language, Vol. 16 DOI: 10.1006/csla.2002.0194 - Foundational paper introducing WFST theory and application for combining acoustic, lexicon, and language models in ASR decoding.
Finite-State Transducers in Natural Language Processing, Brian Roark, Richard Sproat, 2007 (Morgan & Claypool Publishers) DOI: 10.2200/S00078ED1V01Y200709NLP005 - A book providing an in-depth treatment of finite-state transducers and their uses, including details relevant to speech decoding graphs.
OpenFst: An Open-Source, Weighted Finite-State Transducer Library, Mehryar Mohri, Michael Riley, Fernando Pereira, Cyril Allauzen, 2008 Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08) (European Language Resources Association (ELRA)) DOI: 10.55917/ccf-k.55921 - Describes the OpenFst library, a standard tool for implementing and manipulating weighted finite-state transducers used in constructing decoding graphs.