All Courses

Applied Speech Recognition

Chapter 1: Foundations of Digital Audio and Speech

Introduction to Automatic Speech Recognition Systems

Properties of Human Speech: Phonemes and Allophones

Digital Audio Signals: Sampling, Quantization, and Encoding

Working with Audio Data in Python using Librosa

Time and Frequency Domain Analysis

Introduction to Spectrograms for Speech Visualization

Hands-on Practical: Loading and Visualizing Audio Waveforms

Chapter 2: Feature Extraction for Speech Recognition

The Role of Feature Extraction in ASR

Mel Frequency Cepstral Coefficients (MFCCs)

Calculating MFCCs Step-by-Step

Filter Banks and Log-Mel Spectrograms

Feature Normalization Techniques

Comparing MFCCs and Spectrograms as Input Features

Practice: Extracting and Normalizing Features from a Dataset

Chapter 3: Acoustic Modeling with Deep Neural Networks

Overview of Acoustic Models in ASR

Building Acoustic Models with Recurrent Neural Networks

Addressing Sequential Challenges with LSTMs and GRUs

Connectionist Temporal Classification (CTC) Loss

Implementing a CTC-based ASR Model

Hands-on Practical: Training a Simple LSTM Acoustic Model with CTC

Chapter 4: Advanced Acoustic Models and Architectures

Attention Mechanisms for Speech Recognition

Sequence-to-Sequence (Seq2Seq) Models for ASR

Listen, Attend, and Spell (LAS) Architecture

Introduction to Transformer Models for ASR

Conformer: Combining CNNs and Transformers

An Overview of Pre-trained ASR Models

Practice: Fine-tuning a Pre-trained ASR Model

Chapter 5: Language Modeling and Decoding

The Function of Language Models in ASR

N-gram Language Models

Building an N-gram Model with KenLM

Decoding Graphs for Model Integration

Decoding Algorithms: Greedy Search vs Beam Search

Implementing Beam Search with a Language Model

Hands-on Practical: Integrating a Language Model into a CTC Decoder

Chapter 6: Evaluating and Deploying ASR Systems

Metrics for ASR Performance: WER and CER

Calculating Word Error Rate

Common Data Augmentation Techniques for Speech

Using Hugging Face Pipelines for ASR

Building a Speech-to-Text Application with Gradio

Considerations for Real-time Streaming ASR

Practice: Evaluating and Building a Demo Application

Implementing Beam Search with a Language Model

Was this section helpful?

References

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Alex Graves, Santiago Fernández, Faustina Gomez, and Jürgen Schmidhuber, 2006 Proceedings of the 23rd International Conference on Machine Learning (ICML '06) (ACM) DOI: 10.1145/1143844.1143891 - Introduces Connectionist Temporal Classification (CTC), which is the source of the acoustic scores for the beam search, and discusses basic CTC beam search.
Automatic Speech Recognition: A Deep Learning Approach, Dong Yu, Li Deng, 2014 (Springer) DOI: 10.1007/978-1-4471-5779-3 - A comprehensive treatment of modern ASR, with Chapter 7 detailing decoding strategies for large vocabulary continuous speech recognition (LVCSR), including language model integration.

© 2025 ApX Machine LearningEngineered with