Audio signals are typically processed into frames, and windowing functions are applied, resulting in a series of short audio segments. The Fourier Transform can be used to create a spectrogram, a visual representation showing how the frequency content of a speech signal changes over time. While a spectrogram offers a rich representation, it often includes extensive information that is not equally relevant for speech understanding. Moreover, its high dimensionality can make it computationally intensive for direct processing by a machine learning model.We need a way to extract the most significant characteristics from the audio signal while discarding redundant information. This is the goal of feature extraction. The most common and historically significant features for speech recognition are Mel-Frequency Cepstral Coefficients, or MFCCs. They are designed to represent audio in a way that is more aligned with how humans perceive sound.The Mel Scale: Emulating Human HearingA foundational aspect of MFCCs is the Mel scale, a perceptual scale of pitches judged by listeners to be equal in distance from one another. Humans are much better at discerning small changes in pitch at low frequencies than at high frequencies. For example, the perceived difference between 100 Hz and 200 Hz is much greater than the difference between 10,000 Hz and 10,100 Hz, even though the absolute difference is the same.The Mel scale remaps linear frequency (measured in Hertz) to better reflect this property of human hearing. The formula to convert from frequency $f$ in Hertz to Mels is:$$ m = 2595 \cdot \log_{10}(1 + \frac{f}{700}) $$This remapping gives more resolution to lower frequencies and less to higher ones, effectively focusing on the parts of the spectrum most relevant to speech.{"layout":{"xaxis":{"title":"Frequency (Hz)","gridcolor":"#e9ecef"},"yaxis":{"title":"Mel Scale","gridcolor":"#e9ecef"},"paper_bgcolor":"#ffffff","plot_bgcolor":"#ffffff","font":{"color":"#495057"},"margin":{"l":60,"r":20,"t":20,"b":50}},"data":[{"x":[0,500,1000,2000,4000,8000],"y":[0,607,969,1500,2146,2835],"type":"scatter","mode":"lines","name":"Mel Scale","line":{"color":"#4263eb","width":3}},{"x":[0,500,1000,2000,4000,8000],"y":[0,500,1000,2000,4000,8000],"type":"scatter","mode":"lines","name":"Linear Scale","line":{"color":"#adb5bd","width":2,"dash":"dash"}}]}The relationship between linear frequency (Hz) and the perceptual Mel scale. Notice how the Mel scale rises steeply at lower frequencies and then flattens out, showing that perceptual pitch changes are more sensitive in that lower range.Calculating MFCCs: A Step-by-Step ProcessCreating MFCCs involves a sequence of transformations applied to each audio frame. Each step is designed to progressively isolate and compress the information that defines the sounds of speech.digraph G { rankdir=TB; splines=ortho; node [shape=box, style="rounded,filled", fillcolor="#a5d8ff", fontname="sans-serif", margin="0.2,0.1"]; edge [fontname="sans-serif"]; bgcolor="transparent"; "Audio Frame" [fillcolor="#b2f2bb"]; "FFT" [label="Take the Fourier Transform\n(to get Power Spectrum)"]; "Mel Filters" [label="Apply Mel Filter Bank"]; "Log" [label="Take the Logarithm"]; "DCT" [label="Take the Discrete\nCosine Transform (DCT)"]; "MFCCs" [fillcolor="#b2f2bb"]; "Audio Frame" -> "FFT"; "FFT" -> "Mel Filters"; "Mel Filters" -> "Log"; "Log" -> "DCT"; "DCT" -> "MFCCs"; }The pipeline for calculating Mel-Frequency Cepstral Coefficients from a single audio frame.Let's walk through what happens at each stage of this pipeline.1. Compute the Power SpectrumFor each windowed frame of audio, we perform a Fast Fourier Transform (FFT) to get its frequency spectrum. We then compute the power spectrum by taking the square of the magnitude of the complex numbers from the FFT. This gives us a measure of the energy present at each frequency band for that specific frame. This is the same information used to generate one time-slice of a spectrogram.2. Apply the Mel Filter BankThis is where the Mel scale comes into play. We create a Mel filter bank, which is a set of 20 to 40 triangular filters. These filters are narrow and closely spaced at low frequencies and wider and more spread out at high frequencies, matching the Mel scale.{"layout":{"xaxis":{"title":"Frequency (Hz)","gridcolor":"#e9ecef"},"yaxis":{"title":"Weight","gridcolor":"#e9ecef","showticklabels":false},"paper_bgcolor":"#ffffff","plot_bgcolor":"#ffffff","font":{"color":"#495057"},"showlegend":false,"margin":{"l":20,"r":20,"t":20,"b":50}},"data":[{"x":[0,100,200],"y":[0,1,0],"type":"scatter","mode":"lines","fill":"tozeroy","fillcolor":"#d0bfff","line":{"color":"#845ef7"}},{"x":[100,250,400],"y":[0,1,0],"type":"scatter","mode":"lines","fill":"tozeroy","fillcolor":"#d0bfff","line":{"color":"#845ef7"}},{"x":[250,450,650],"y":[0,1,0],"type":"scatter","mode":"lines","fill":"tozeroy","fillcolor":"#d0bfff","line":{"color":"#845ef7"}},{"x":[450,750,1050],"y":[0,1,0],"type":"scatter","mode":"lines","fill":"tozeroy","fillcolor":"#d0bfff","line":{"color":"#845ef7"}},{"x":[750,1200,1650],"y":[0,1,0],"type":"scatter","mode":"lines","fill":"tozeroy","fillcolor":"#d0bfff","line":{"color":"#845ef7"}},{"x":[1200,1800,2400],"y":[0,1,0],"type":"scatter","mode":"lines","fill":"tozeroy","fillcolor":"#d0bfff","line":{"color":"#845ef7"}},{"x":[1800,2600,3400],"y":[0,1,0],"type":"scatter","mode":"lines","fill":"tozeroy","fillcolor":"#d0bfff","line":{"color":"#845ef7"}},{"x":[2600,3600,4600],"y":[0,1,0],"type":"scatter","mode":"lines","fill":"tozeroy","fillcolor":"#d0bfff","line":{"color":"#845ef7"}}]}A Mel filter bank with triangular filters. Notice how the filters are narrower and more crowded at the lower frequencies and become wider at higher frequencies.We multiply the power spectrum by each of these triangular filters and add up the energy in each band. The result is a list of numbers, one for each filter, representing the amount of energy in different regions of the perceptually-scaled spectrum. This step effectively reduces the dimensionality of our data from hundreds or thousands of frequency bins to just 20-40 filter bank energies.3. Take the Logarithm of Filter Bank EnergiesAfter applying the filter bank, we take the logarithm of all the filter bank energies. This step also relates to human perception, as our response to loudness is logarithmic, not linear. It helps to compress the dynamic range of the values, making the features less sensitive to variations in signal energy.4. Take the Discrete Cosine Transform (DCT)The final major step is to compute the Discrete Cosine Transform (DCT) of the log filter bank energies. The resulting log energies from the filter bank are highly correlated with each other. The DCT is a mathematical operation that decorrelates these energies, meaning it separates them into components that are more independent.This process is very effective at concentrating the most important information into the first few coefficients. Think of it like image compression, where a complex image is represented by a smaller, more efficient set of values. The resulting coefficients are the Mel-Frequency Cepstral Coefficients.The Final Feature VectorTypically, we only keep the first 12 to 13 of these DCT coefficients. The higher coefficients represent very rapid changes in the filter bank energies, which are often less informative for speech and can be sensitive to noise. The 0th coefficient, which represents the average energy of the frame, is sometimes discarded or treated separately.After this entire process, each 25ms audio frame is transformed from thousands of raw sample points into a small vector of just 12 or 13 numbers. When we stack these vectors from all the frames in an audio clip, we get a feature matrix. This matrix, where each row corresponds to a frame in time and each column is a coefficient, is the final representation that we feed into our acoustic model. It is a compact and perceptually relevant summary of the original audio signal.