A digital audio signal, created by converting an analog sound wave into a sequence of numbers, raises the question of whether this data can be fed directly to a machine learning model. The short answer is that this is generally not done. Raw digital audio, while a complete representation of the sound, is not in a useful format for a speech recognition system.Why Not Use Raw Audio?Feeding the raw sequence of audio samples to a model presents several significant problems:High Dimensionality: The sheer amount of data is a major challenge. A single second of audio sampled at 16 kHz results in 16,000 numbers. A five-second clip is 80,000 numbers. Training a model on such long sequences is computationally intensive and can make it difficult for the model to learn meaningful patterns.Irrelevant Information: Raw audio contains a great deal of information that is not relevant to identifying the spoken words. This includes steady background noise, the specific pitch of a speaker's voice (which can vary greatly), and other acoustic artifacts. Our goal is to isolate the signal, the speech, from this noise.Lack of Consistency: The raw waveform of a word can look dramatically different when spoken by two different people, or even by the same person at a different volume or with a different emotion. A model trained on these raw values would struggle to generalize and recognize the same word across these variations.To solve these problems, we perform feature extraction. The goal is to transform the high-dimensional, noisy audio signal into a more compact, stable, and informative representation. This new representation is a set of features.Think of it like summarizing a long movie. Instead of describing every single frame, you would describe the main characters, important plot events, and the setting. These are the "features" of the movie. In speech recognition, features are numerical values that describe the important acoustic properties of a small slice of audio, making the task for the machine learning model much more manageable.digraph G { rankdir=TB; splines=ortho; node [shape=box, style="rounded,filled", fillcolor="#e9ecef", fontname="Arial"]; edge [fontname="Arial"]; "Raw Audio" [fillcolor="#a5d8ff"]; "Features" [fillcolor="#96f2d7"]; "Raw Audio" -> "Pre-processing" [label="Framing, Windowing"]; "Pre-processing" -> "FFT" [label="Create Spectrum"]; "FFT" -> "Spectrogram"; "Spectrogram" -> "Mel Filterbank" [label="Apply Perceptual Scale"]; "Mel Filterbank" -> "DCT" [label="Decorrelate & Compress"]; "DCT" -> "Features" [label="MFCCs"]; }The feature extraction pipeline transforms a raw audio signal into a compact set of features.From Spectrograms to Perceptual FeaturesIn the previous section, we saw how a spectrogram visualizes frequency content over time. A spectrogram is a significant improvement over a raw waveform and is itself a type of feature representation. It moves us from analyzing simple amplitude to analyzing a rich frequency spectrum where patterns related to speech start to become visible.However, a standard spectrogram uses a linear frequency scale. For example, the distance between 100 Hz and 200 Hz is treated the same as the distance between 4000 Hz and 4100 Hz. But human hearing doesn't work this way. We are much more sensitive to changes in low-frequency sounds than in high-frequency sounds. For speech, most of the distinguishing information that separates one phoneme from another is concentrated in these lower frequencies.To build a better feature set, we need a representation that more closely mimics the properties of human hearing. This brings us to the Mel scale, a perceptual scale of pitches. The Mel scale is designed so that sounds that are separated by an equal distance on the scale are also perceived by humans as being an equal distance apart.By transforming our frequency information onto the Mel scale, we can give more importance to the frequency bands that are most relevant for understanding human speech. This is the central idea behind Mel-Frequency Cepstral Coefficients (MFCCs), one of the most successful and widely-used features in speech recognition systems.In the next section, we will walk through the exact steps for calculating these powerful features from the spectrogram.