Applied Speech Recognition
Chapter 1: Foundations of Digital Audio and Speech
Introduction to Automatic Speech Recognition Systems
Properties of Human Speech: Phonemes and Allophones
Digital Audio Signals: Sampling, Quantization, and Encoding
Working with Audio Data in Python using Librosa
Time and Frequency Domain Analysis
Introduction to Spectrograms for Speech Visualization
Hands-on Practical: Loading and Visualizing Audio Waveforms
Chapter 2: Feature Extraction for Speech Recognition
The Role of Feature Extraction in ASR
Mel Frequency Cepstral Coefficients (MFCCs)
Calculating MFCCs Step-by-Step
Filter Banks and Log-Mel Spectrograms
Feature Normalization Techniques
Comparing MFCCs and Spectrograms as Input Features
Practice: Extracting and Normalizing Features from a Dataset
Chapter 3: Acoustic Modeling with Deep Neural Networks
Overview of Acoustic Models in ASR
Building Acoustic Models with Recurrent Neural Networks
Addressing Sequential Challenges with LSTMs and GRUs
Connectionist Temporal Classification (CTC) Loss
Implementing a CTC-based ASR Model
Hands-on Practical: Training a Simple LSTM Acoustic Model with CTC
Chapter 4: Advanced Acoustic Models and Architectures
Attention Mechanisms for Speech Recognition
Sequence-to-Sequence (Seq2Seq) Models for ASR
Listen, Attend, and Spell (LAS) Architecture
Introduction to Transformer Models for ASR
Conformer: Combining CNNs and Transformers
An Overview of Pre-trained ASR Models
Practice: Fine-tuning a Pre-trained ASR Model
Chapter 5: Language Modeling and Decoding
The Function of Language Models in ASR
Building an N-gram Model with KenLM
Decoding Graphs for Model Integration
Decoding Algorithms: Greedy Search vs Beam Search
Implementing Beam Search with a Language Model
Hands-on Practical: Integrating a Language Model into a CTC Decoder
Chapter 6: Evaluating and Deploying ASR Systems
Metrics for ASR Performance: WER and CER
Calculating Word Error Rate
Common Data Augmentation Techniques for Speech
Using Hugging Face Pipelines for ASR
Building a Speech-to-Text Application with Gradio
Considerations for Real-time Streaming ASR
Practice: Evaluating and Building a Demo Application