Decoding Algorithms: Greedy Search vs Beam Search

Once your acoustic model processes an audio signal, it produces a matrix of probabilities. For each time step, it assigns a probability to every possible character in your vocabulary (including a special blank token for CTC). The challenge now is to navigate this sea of probabilities to find the most likely sequence of words. An exhaustive search, trying every single possible path, is computationally impossible as the number of paths grows exponentially with the sequence length.

This is where decoding algorithms come in. They are intelligent search strategies designed to find a high-quality transcription without evaluating every possibility. We will look at two main approaches: a simple, fast method called greedy search, and a more effective but computationally intensive method called beam search.

Greedy Search: The Simplest Path Forward

Greedy search, also known as best path decoding, is the most straightforward decoding strategy. At each time step of the acoustic model's output, it simply selects the single character with the highest probability. It's "greedy" because it makes the locally optimal choice at each step, hoping it will lead to a globally optimal result.

Let's illustrate with an example. Imagine our model has processed a short audio clip and produced the following probability matrix for the first few time steps. For simplicity, we only show the top few character probabilities.

Time ->       1       2       3       4       5
-------------------------------------------------
'c'          0.1     0.8     0.1     0.1     0.1
'a'          0.2     0.1     0.7     0.1     0.2
't'          0.6     0.1     0.1     0.2     0.8
'_' (blank)  0.1     0.0     0.1     0.6     0.1

A greedy decoder would perform the following steps:

Time 1: The highest probability is 0.6 for the character 't'. Path: t.
Time 2: The highest probability is 0.8 for 'c'. Path: tc.
Time 3: The highest probability is 0.7 for 'a'. Path: tca.
Time 4: The highest probability is 0.6 for the blank token. Path: tca_.
Time 5: The highest probability is 0.8 for 't'. Path: tca_t.

The raw output path is tca_t. To get the final transcription, we apply two CTC post-processing rules:

Merge consecutive identical characters: No changes needed here.
Remove all blank tokens: tca_t becomes tcat.

This seems reasonable, but the greedy approach has a significant flaw: it's short-sighted. A choice that looks best at one time step might lead to a dead end later on. For example, the model might be slightly more confident about the character 'w' than 'r' at the beginning of "recognize". A greedy decoder would lock in 'w' and might be forced to produce "weck a nice beach" because recovering from the initial mistake is impossible. It has no mechanism for backtracking or exploring slightly less likely but ultimately more promising paths.

Beam Search: Keeping Your Options Open

Beam search provides a more solution by exploring multiple potential paths, or possibilities, simultaneously. Instead of committing to a single best character at each step, it maintains a list of the k most probable possibilities. This list is called the "beam," and k is the "beam width."

The process works as follows:

Initialization (Time 1): Start by selecting the top k characters from the first time step's probability distribution. These k characters form our initial beam of hypotheses.
Expansion (Time t): For each of the k hypotheses currently in the beam, expand it by appending every possible character from the vocabulary. If the beam width is k and the vocabulary size is N, this creates k * N new candidate hypotheses.
Scoring and Pruning (Time t): Calculate the probability of each of the k * N new candidates. Then, sort all candidates by their score and keep only the top k hypotheses. This new set of k hypotheses becomes the beam for the next time step.
Iteration: Repeat the expansion and pruning steps until all time steps have been processed.
Final Selection: After the final time step, the hypothesis with the highest overall score is chosen as the final transcription.

The diagram below illustrates this process with a beam width of k=2. At each step, all possible extensions are evaluated, but only the two most promising paths are kept for the next step, while the others (shown in gray) are discarded.

A visualization of beam search decoding with a beam width of 2. At each time step, hypotheses are expanded, but only the top two (blue) are retained for the next step. Less likely paths (gray) are pruned. The numbers represent the log-probability scores.

Integrating the Language Model

The real power of beam search is unlocked when we integrate the language model. Instead of scoring hypotheses based only on the acoustic model's output, we can use the combined score formula introduced at the beginning of the chapter:

\text{score}(\text{hypothesis}) = \text{acoustic\_score} + \alpha \cdot \text{lm\_score}

During the beam search process, specifically in the scoring and pruning step, we adjust the score of each candidate hypothesis. The acoustic_score is the cumulative probability from the acoustic model. The lm_score is the probability assigned by the language model to the sequence of words formed by the hypothesis so far. The weight $\alpha$ is a hyperparameter that you can tune to control how much influence the language model has on the final decision. A higher $\alpha$ makes the decoder favor grammatically correct or common phrases, while a lower $\alpha$ makes it trust the acoustic model more.

For example, if the decoder is evaluating two hypotheses, "recognize speech" and "wreck a nice beach", the acoustic scores might be very similar. However, a well-trained language model will assign a much higher probability to "recognize speech", boosting its total score and ensuring it remains in the beam.

A larger beam width k allows the decoder to explore more possibilities, increasing the chance of finding the correct transcription. However, this comes at the cost of increased computation and memory, as the number of hypotheses to evaluate at each step grows linearly with k. In practice, choosing a beam width between 5 and 10 often provides a good balance between accuracy and performance.

By keeping multiple hypotheses active and incorporating linguistic knowledge, beam search significantly outperforms greedy search and is the standard decoding algorithm used in most high-performance speech recognition systems today.

Was this section helpful?

References

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Alex Graves, Santiago Fernández, Faustino Gomez, Jürgen Schmidhuber, 2006 Proceedings of the 23rd International Conference on Machine Learning (ACM) DOI: 10.1145/1143844.1143891 - Introduces the Connectionist Temporal Classification (CTC) loss function, which is fundamental to the "blank" token and post-processing discussed, and details its associated decoding algorithms, including greedy and beam search.
Speech and Language Processing, Daniel Jurafsky and James H. Martin, 2025 - A comprehensive textbook providing detailed explanations of speech recognition, including decoding algorithms like greedy and beam search, and the role of language models in ASR. Chapter 10 "Speech Recognition" and Chapter 12 "Speech Synthesis, Statistical Models, and Deep Learning for ASR" are particularly relevant.
Practical End-to-End Speech Recognition with Beam Search, Veronica Tozzo, Federico Tomasi, Margherita Squillario, Annalisa Barla, 2018 arXiv preprint arXiv:1811.09673 DOI: 10.48550/arXiv.1811.09673 - This paper provides a practical overview of end-to-end speech recognition, specifically focusing on beam search decoding with CTC and the integration of language models, which directly addresses the core topics of the section.