Advanced Speech Recognition and Synthesis
Chapter 1: Foundations of Modern Speech Processing Pipelines
Advanced Audio Feature Extraction
Statistical Modeling Review for Speech
Deep Learning Architectures for Sequences
Components of ASR Systems
Components of TTS Systems
Evaluation Metrics Revisited
Chapter 2: Advanced Acoustic Modeling for ASR
Connectionist Temporal Classification (CTC)
Attention-Based Encoder-Decoder Models
Transformer Architectures for ASR
Advanced Training Techniques
Decoding Algorithms Comparison
Hands-on Practical: Building an End-to-End ASR Model
Chapter 3: Language Modeling and Adaptation in ASR
Neural Language Models for ASR
Shallow Fusion and Deep Fusion
Speaker Adaptation Techniques
Environment and Channel Adaptation
Unsupervised and Semi-Supervised Learning for ASR
Multi-Lingual and Cross-Lingual ASR
Practice: Fine-tuning ASR with Adaptation Data
Chapter 4: Advanced Text-to-Speech Synthesis
Autoregressive Acoustic Models (Tacotron, Transformer TTS)
Non-Autoregressive Acoustic Models (FastSpeech, ParaNet)
Flow-Based Models for TTS
Generative Adversarial Networks (GANs) in TTS
Prosody Modeling and Control
Expressive Speech Synthesis
Voice Cloning and Conversion
Hands-on Practical: Training an Advanced TTS Model
Chapter 5: Neural Vocoders and Waveform Generation
Limitations of Traditional Vocoders
Autoregressive Waveform Models (WaveNet, WaveRNN)
Flow-Based Vocoders (WaveGlow, FloWaveNet)
GAN-Based Vocoders (MelGAN, HiFi-GAN)
Diffusion Models for Vocoding
Conditioning Neural Vocoders
Evaluation of Synthesized Audio Quality
Hands-on Practical: Using a Neural Vocoder
Chapter 6: Optimization, Deployment, and Toolkits
Model Quantization for Speech Models
Model Pruning and Sparsification
Knowledge Distillation for ASR/TTS
Optimized Inference Engines (ONNX Runtime, TensorRT)
Deployment Considerations for Streaming ASR
Deployment Considerations for Real-Time TTS
Overview of Speech Processing Toolkits (ESPnet, NeMo, Coqui)
Practice: Optimizing a Speech Model