All Courses

Applied Speech Recognition

Chapter 1: Foundations of Digital Audio and Speech

Introduction to Automatic Speech Recognition Systems

Properties of Human Speech: Phonemes and Allophones

Digital Audio Signals: Sampling, Quantization, and Encoding

Working with Audio Data in Python using Librosa

Time and Frequency Domain Analysis

Introduction to Spectrograms for Speech Visualization

Hands-on Practical: Loading and Visualizing Audio Waveforms

Chapter 2: Feature Extraction for Speech Recognition

The Role of Feature Extraction in ASR

Mel Frequency Cepstral Coefficients (MFCCs)

Calculating MFCCs Step-by-Step

Filter Banks and Log-Mel Spectrograms

Feature Normalization Techniques

Comparing MFCCs and Spectrograms as Input Features

Practice: Extracting and Normalizing Features from a Dataset

Chapter 3: Acoustic Modeling with Deep Neural Networks

Overview of Acoustic Models in ASR

Building Acoustic Models with Recurrent Neural Networks

Addressing Sequential Challenges with LSTMs and GRUs

Connectionist Temporal Classification (CTC) Loss

Implementing a CTC-based ASR Model

Hands-on Practical: Training a Simple LSTM Acoustic Model with CTC

Chapter 4: Advanced Acoustic Models and Architectures

Attention Mechanisms for Speech Recognition

Sequence-to-Sequence (Seq2Seq) Models for ASR

Listen, Attend, and Spell (LAS) Architecture

Introduction to Transformer Models for ASR

Conformer: Combining CNNs and Transformers

An Overview of Pre-trained ASR Models

Practice: Fine-tuning a Pre-trained ASR Model

Chapter 5: Language Modeling and Decoding

The Function of Language Models in ASR

N-gram Language Models

Building an N-gram Model with KenLM

Decoding Graphs for Model Integration

Decoding Algorithms: Greedy Search vs Beam Search

Implementing Beam Search with a Language Model

Hands-on Practical: Integrating a Language Model into a CTC Decoder

Chapter 6: Evaluating and Deploying ASR Systems

Metrics for ASR Performance: WER and CER

Calculating Word Error Rate

Common Data Augmentation Techniques for Speech

Using Hugging Face Pipelines for ASR

Building a Speech-to-Text Application with Gradio

Considerations for Real-time Streaming ASR

Practice: Evaluating and Building a Demo Application

An Overview of Pre-trained ASR Models

Was this section helpful?

References

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed, 2021 IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30 DOI: 10.48550/arXiv.2106.07447 - This paper presents HuBERT, an alternative self-supervised method that uses an offline clustering step to generate discrete hidden units as targets for masked prediction, offering a different approach to speech representation learning.
Robust Speech Recognition via Large-Scale Weak Supervision, Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, 2022 (OpenAI) DOI: https://arxiv.org/abs/2212.04356 - This report describes the Whisper model family, which achieves strong performance through weakly supervised training on a massive and diverse dataset of audio-text pairs, enabling multi-lingual and multi-task capabilities.

© 2025 ApX Machine LearningEngineered with