Once your acoustic model processes an audio signal, it produces a matrix of probabilities. For each time step, it assigns a probability to every possible character in your vocabulary (including a special blank token for CTC). The challenge now is to navigate this sea of probabilities to find the most likely sequence of words. An exhaustive search, trying every single possible path, is computationally impossible as the number of paths grows exponentially with the sequence length.
This is where decoding algorithms come in. They are intelligent search strategies designed to find a high-quality transcription without evaluating every possibility. We will look at two main approaches: a simple, fast method called greedy search, and a more effective but computationally intensive method called beam search.
Greedy search, also known as best path decoding, is the most straightforward decoding strategy. At each time step of the acoustic model's output, it simply selects the single character with the highest probability. It's "greedy" because it makes the locally optimal choice at each step, hoping it will lead to a globally optimal result.
Let's illustrate with an example. Imagine our model has processed a short audio clip and produced the following probability matrix for the first few time steps. For simplicity, we only show the top few character probabilities.
Time -> 1 2 3 4 5
-------------------------------------------------
'c' 0.1 0.8 0.1 0.1 0.1
'a' 0.2 0.1 0.7 0.1 0.2
't' 0.6 0.1 0.1 0.2 0.8
'_' (blank) 0.1 0.0 0.1 0.6 0.1
A greedy decoder would perform the following steps:
t.tc.tca.blank token. Path: tca_.tca_t.The raw output path is tca_t. To get the final transcription, we apply two CTC post-processing rules:
blank tokens: tca_t becomes tcat.This seems reasonable, but the greedy approach has a significant flaw: it's short-sighted. A choice that looks best at one time step might lead to a dead end later on. For example, the model might be slightly more confident about the character 'w' than 'r' at the beginning of "recognize". A greedy decoder would lock in 'w' and might be forced to produce "weck a nice beach" because recovering from the initial mistake is impossible. It has no mechanism for backtracking or considering slightly less likely but ultimately more promising paths.
Beam search provides a more solution by exploring multiple potential paths, or possibilities, simultaneously. Instead of committing to a single best character at each step, it maintains a list of the k most probable possibilities. This list is called the "beam," and k is the "beam width."
The process works as follows:
k characters from the first time step's probability distribution. These k characters form our initial beam of hypotheses.k hypotheses currently in the beam, expand it by appending every possible character from the vocabulary. If the beam width is k and the vocabulary size is N, this creates k * N new candidate hypotheses.k * N new candidates. Then, sort all candidates by their score and keep only the top k hypotheses. This new set of k hypotheses becomes the beam for the next time step.The diagram below illustrates this process with a beam width of k=2. At each step, all possible extensions are considered, but only the two most promising paths are kept for the next step, while the others (shown in gray) are discarded.
A visualization of beam search decoding with a beam width of 2. At each time step, hypotheses are expanded, but only the top two (blue) are retained for the next step. Less likely paths (gray) are pruned. The numbers represent the log-probability scores.
The real power of beam search is unlocked when we integrate the language model. Instead of scoring hypotheses based only on the acoustic model's output, we can use the combined score formula introduced at the beginning of the chapter:
score(hypothesis)=acoustic_score+α⋅lm_scoreDuring the beam search process, specifically in the scoring and pruning step, we adjust the score of each candidate hypothesis. The acoustic_score is the cumulative probability from the acoustic model. The lm_score is the probability assigned by the language model to the sequence of words formed by the hypothesis so far. The weight α is a hyperparameter that you can tune to control how much influence the language model has on the final decision. A higher α makes the decoder favor grammatically correct or common phrases, while a lower α makes it trust the acoustic model more.
For example, if the decoder is considering two hypotheses, "recognize speech" and "wreck a nice beach", the acoustic scores might be very similar. However, a well-trained language model will assign a much higher probability to "recognize speech", boosting its total score and ensuring it remains in the beam.
A larger beam width k allows the decoder to explore more possibilities, increasing the chance of finding the correct transcription. However, this comes at the cost of increased computation and memory, as the number of hypotheses to evaluate at each step grows linearly with k. In practice, choosing a beam width between 5 and 10 often provides a good balance between accuracy and performance.
By keeping multiple hypotheses active and incorporating linguistic knowledge, beam search significantly outperforms greedy search and is the standard decoding algorithm used in most high-performance speech recognition systems today.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with