The decoder's primary function is to find the single sequence of words ( $W$ ) that best matches the incoming audio features ( $O$ ). This isn't just a matter of finding what sounds right; it's about finding what sounds right and makes sense linguistically. The decoder achieves this by combining the strengths of the acoustic model and the language model.

This process is elegantly captured by the fundamental equation of speech recognition:

\hat{W} = \underset{W}{\mathrm{argmax}} \, P(O|W) \times P(W)

Let's break down what this means in practical terms.

Balancing Two Sources of Evidence

Imagine the ASR system is listening to someone say something that sounds like "wreck a nice beach."

The Acoustic Model's Contribution ( $P(O|W)$ ): The acoustic model listens to the audio features ( $O$ ) and calculates the probability that those sounds were produced by the words in a candidate sentence ( $W$ ). For the phrase "wreck a nice beach," the acoustic model would likely return a high probability. The sounds match the words well. However, for the phrase "recognize speech," the acoustic model might also return a high probability, because the two phrases are acoustically very similar (homophones). The acoustic model, on its own, can be easily confused.
The Language Model's Contribution ( $P(W)$ ): The language model has no knowledge of the audio. Its job is to determine the probability of a sequence of words ( $W$ ) appearing in the language. It has been trained on massive amounts of text and knows that the phrase "recognize speech" is common, while "wreck a nice beach" is nonsensical and extremely rare. Therefore, it will assign a high probability to "recognize speech" and a near-zero probability to "wreck a nice beach."

The decoder's job is to generate these possible sentences, called hypotheses, and then act as a judge, weighing the evidence from both models to make a final decision.

Combining Scores for a Final Verdict

The decoder multiplies the acoustic model score with the language model score for every hypothesis. The hypothesis with the highest combined score wins.

Hypothesis 1: "wreck a nice beach"
- Acoustic Score: High (e.g., 0.9)
- Language Model Score: Very Low (e.g., 0.0001)
- Combined Score: $0.9 \times 0.0001 = 0.00009$
Hypothesis 2: "recognize speech"
- Acoustic Score: High (e.g., 0.88)
- Language Model Score: High (e.g., 0.1)
- Combined Score: $0.88 \times 0.1 = 0.088$

Even though the acoustic scores were very close, the language model acted as a powerful tie-breaker, making "recognize speech" the clear winner.

Working with Log Probabilities

In a real system, multiplying many small probabilities together can lead to a problem called numerical underflow, where the result is so small that the computer treats it as zero. To avoid this, systems work with log probabilities instead. By taking the logarithm of the probabilities, multiplication becomes addition, which is computationally faster and more stable.

The formula then becomes:
$\hat{W} = \underset{W}{\mathrm{argmax}} \, (\log P(O|W) + \log P(W))$
The goal is the same: find the hypothesis with the highest score. Since the logarithm of a probability (a number between 0 and 1) is always negative, this is equivalent to finding the score closest to zero.

The following diagram illustrates how the decoder uses scores from both models to resolve ambiguity.

A diagram of the decoding process. The decoder receives two acoustically plausible hypotheses. The language model assigns a very poor score to the nonsensical phrase, allowing the decoder to select the correct transcription with a much higher combined score.

Ultimately, the challenge for the decoder is that the number of possible word sequences can be astronomically large. It cannot simply test every possible sentence in the English language. Instead, it must use efficient search algorithms to navigate this space and find the optimal path. We will look at how it accomplishes this in the next section.

Finding the Most Likely Sequence of Words