Chapter 5: Decoding and Putting It All Together

In the previous chapters, you learned how to build an acoustic model, which maps audio features to phonemes, and a language model, which determines the likelihood of a sequence of words. This chapter addresses the final step: combining these two sources of information to produce the most probable text transcription.

The core task of speech recognition is to find the word sequence ( $W$ ) that is most probable given the observed audio features ( $O$ ). This is often expressed as finding the sequence that maximizes the product of two probabilities:

\hat{W} = \underset{W}{\mathrm{argmax}} \, P(O|W) \times P(W)

Here, $P(O|W)$ is the probability assigned by the acoustic model, and $P(W)$ is the probability from the language model. The component responsible for this calculation and search is the decoder.

You will learn how the decoder functions as a search algorithm, navigating through a massive space of possible sentences to find the best one. We will cover the logic behind search strategies like the Viterbi algorithm. Following that, we will review the complete ASR pipeline from start to finish. To conclude the chapter, you will learn to evaluate system performance using Word Error Rate (WER) and identify common challenges that affect transcription accuracy.

Sections

5.1 The Role of the Decoder
5.2 Finding the Most Likely Sequence of Words
5.3 Introduction to Search Algorithms
5.4 Understanding the Viterbi Algorithm
5.5 The Complete ASR Pipeline: A Review
5.6 Evaluating Performance: Word Error Rate (WER)
5.7 Common Challenges in Speech Recognition