After transforming raw audio into structured feature vectors, the next step is to build a model that maps these features to their corresponding text transcriptions. This is the primary function of the acoustic model, the component that learns the relationship between speech sounds and linguistic units like characters.
This chapter introduces the use of deep neural networks for acoustic modeling. We will begin by examining how Recurrent Neural Networks (RNNs) and their more capable variants, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks, are structured to process sequential data.
A significant challenge in this process is that the input feature sequence and the output text sequence have different lengths and no explicit alignment. We will address this using the Connectionist Temporal Classification (CTC) loss function. CTC is a mechanism that allows the network to learn alignments automatically, simplifying the training process. The goal is to train a model that can map an input feature sequence to a target character sequence , where the input length is typically not equal to the output length .
By the end of this chapter, you will learn how to:
The chapter concludes with a hands-on practical session where you will build and train a basic LSTM-CTC acoustic model.
3.1 Overview of Acoustic Models in ASR
3.2 Building Acoustic Models with Recurrent Neural Networks
3.3 Addressing Sequential Challenges with LSTMs and GRUs
3.4 Connectionist Temporal Classification (CTC) Loss
3.5 Implementing a CTC-based ASR Model
3.6 Hands-on Practical: Training a Simple LSTM Acoustic Model with CTC
© 2026 ApX Machine LearningEngineered with