Raw audio waveforms, as represented by a series of amplitude values over time, are high-dimensional and contain information that is not directly useful for speech recognition. Feeding this raw data, with its tens of thousands of samples per second, directly into a neural network is computationally inefficient and can obscure the phonetic patterns the model needs to learn. This is why feature extraction is a standard step in an ASR pipeline. The goal is to condense the raw audio into a lower-dimensional, more informative format.
In this chapter, we will implement the techniques to create these features. You will learn to build two widely used representations: Mel Frequency Cepstral Coefficients (MFCCs) and log-mel spectrograms. We will go through the step-by-step process for calculating each, compare their characteristics, and discuss why one might be preferred over the other for modern deep learning models. We will also cover normalization techniques, such as Cepstral Mean and Variance Normalization (CMVN), to make features more consistent. The chapter concludes with a practical exercise where you will write code to process an entire audio dataset into a feature set ready for training.
2.1 The Role of Feature Extraction in ASR
2.2 Mel Frequency Cepstral Coefficients (MFCCs)
2.3 Calculating MFCCs Step-by-Step
2.4 Filter Banks and Log-Mel Spectrograms
2.5 Feature Normalization Techniques
2.6 Comparing MFCCs and Spectrograms as Input Features
2.7 Practice: Extracting and Normalizing Features from a Dataset
© 2026 ApX Machine LearningEngineered with