Chapter 4: Advanced Acoustic Models and Architectures

In the previous chapter, we constructed acoustic models using LSTMs and the Connectionist Temporal Classification (CTC) loss. While CTC is a functional approach, it makes a strong assumption that output predictions at each timestep are conditionally independent. This chapter moves to modern architectures that model the dependencies between output characters or words directly, leading to more accurate and context-aware transcriptions.

You will begin with attention mechanisms, which allow a model to selectively focus on relevant segments of the input audio when generating each part of the output transcript. Building on this, you will learn about sequence-to-sequence (Seq2Seq) architectures, such as the Listen, Attend, and Spell (LAS) model, that directly map an input audio sequence to an output text sequence.

We will then cover the Transformer architecture and its use of self-attention for processing audio. You will see how this design captures dependencies across the entire audio input. We will also examine the Conformer model, a hybrid architecture that combines convolutions for local feature extraction with the global context modeling of Transformers.

The chapter concludes with an overview of large, pre-trained ASR models like Wav2Vec 2.0. You will learn how to fine-tune these models on specific datasets, a common and effective technique for achieving high performance. The practical section will guide you through fine-tuning a pre-trained model from the Hugging Face library, giving you direct experience with a state-of-the-art ASR workflow.

Sections

4.1 Attention Mechanisms for Speech Recognition
4.2 Sequence-to-Sequence (Seq2Seq) Models for ASR
4.3 Listen, Attend, and Spell (LAS) Architecture
4.4 Introduction to Transformer Models for ASR
4.5 Conformer: Combining CNNs and Transformers
4.6 An Overview of Pre-trained ASR Models
4.7 Practice: Fine-tuning a Pre-trained ASR Model