This chapter establishes the necessary groundwork for building and understanding advanced speech recognition and synthesis systems. We will begin by examining audio feature extraction techniques beyond standard MFCCs, considering learned representations and filter banks. Following this, we revisit key statistical modeling concepts and deep learning architectures specifically tailored for sequence data, including RNNs, LSTMs, and Transformers, analyzing their application within speech processing. We will then dissect the components that constitute modern ASR and TTS pipelines, detailing their individual functions and interactions. Finally, we refine our understanding of evaluation metrics, moving beyond basic WER and MOS to prepare for assessing the performance of the sophisticated models covered later in this course.
1.1 Advanced Audio Feature Extraction
1.2 Statistical Modeling Review for Speech
1.3 Deep Learning Architectures for Sequences
1.4 Components of ASR Systems
1.5 Components of TTS Systems
1.6 Evaluation Metrics Revisited
© 2025 ApX Machine Learning