All Courses

Applied Speech Recognition

Chapter 1: Foundations of Digital Audio and Speech

Introduction to Automatic Speech Recognition Systems

Properties of Human Speech: Phonemes and Allophones

Digital Audio Signals: Sampling, Quantization, and Encoding

Working with Audio Data in Python using Librosa

Time and Frequency Domain Analysis

Introduction to Spectrograms for Speech Visualization

Hands-on Practical: Loading and Visualizing Audio Waveforms

Chapter 2: Feature Extraction for Speech Recognition

The Role of Feature Extraction in ASR

Mel Frequency Cepstral Coefficients (MFCCs)

Calculating MFCCs Step-by-Step

Filter Banks and Log-Mel Spectrograms

Feature Normalization Techniques

Comparing MFCCs and Spectrograms as Input Features

Practice: Extracting and Normalizing Features from a Dataset

Chapter 3: Acoustic Modeling with Deep Neural Networks

Overview of Acoustic Models in ASR

Building Acoustic Models with Recurrent Neural Networks

Addressing Sequential Challenges with LSTMs and GRUs

Connectionist Temporal Classification (CTC) Loss

Implementing a CTC-based ASR Model

Hands-on Practical: Training a Simple LSTM Acoustic Model with CTC

Chapter 4: Advanced Acoustic Models and Architectures

Attention Mechanisms for Speech Recognition

Sequence-to-Sequence (Seq2Seq) Models for ASR

Listen, Attend, and Spell (LAS) Architecture

Introduction to Transformer Models for ASR

Conformer: Combining CNNs and Transformers

An Overview of Pre-trained ASR Models

Practice: Fine-tuning a Pre-trained ASR Model

Chapter 5: Language Modeling and Decoding

The Function of Language Models in ASR

N-gram Language Models

Building an N-gram Model with KenLM

Decoding Graphs for Model Integration

Decoding Algorithms: Greedy Search vs Beam Search

Implementing Beam Search with a Language Model

Hands-on Practical: Integrating a Language Model into a CTC Decoder

Chapter 6: Evaluating and Deploying ASR Systems

Metrics for ASR Performance: WER and CER

Calculating Word Error Rate

Common Data Augmentation Techniques for Speech

Using Hugging Face Pipelines for ASR

Building a Speech-to-Text Application with Gradio

Considerations for Real-time Streaming ASR

Practice: Evaluating and Building a Demo Application

Addressing Sequential Challenges with LSTMs and GRUs

Was this section helpful?

References

Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997 Neural Computation, Vol. 9 (The MIT Press) DOI: 10.1162/neco.1997.9.8.1735 - Introduces the Long Short-Term Memory (LSTM) architecture, providing a foundational understanding of its design and effectiveness in learning long-range dependencies.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, 2014 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics) DOI: 10.3115/v1/D14-1179 - This seminal paper introduced the Gated Recurrent Unit (GRU) as a simpler alternative to LSTMs, detailing its architecture and performance in sequence modeling tasks.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A comprehensive textbook covering the theoretical foundations and practical aspects of deep learning, including detailed explanations of recurrent neural networks, LSTMs, and GRUs.

© 2026 ApX Machine LearningEngineered with