Calculating MFCCs Step-by-Step

Mel Frequency Cepstral Coefficients (MFCCs) are a foundational feature in speech recognition, engineered to represent audio in a way that accentuates the characteristics of human speech. The calculation process involves several signal processing stages, each designed to transform the raw audio signal into a set of compact and informative coefficients.

The complete pipeline for generating MFCCs from a raw audio signal.

Step 1: Pre-Emphasis

The first step is to apply a pre-emphasis filter to the audio signal. The spectrum of human speech is naturally biased; it has more energy at lower frequencies than at higher frequencies. This can be problematic for models that are sensitive to the dynamic range of the input. Pre-emphasis is a high-pass filter that boosts the energy in higher frequencies.

This serves two main purposes: it balances the frequency spectrum, making it less tilted, and it can improve the signal-to-noise ratio (SNR) by amplifying important high-frequency formants. The filter is typically implemented as a first-order difference equation:

y[n] = x[n] - \alpha x[n-1]

Here, $x[n]$ is a sample from the input signal, $y[n]$ is the output sample, and $\alpha$ is the filter coefficient. A typical value for $\alpha$ is between 0.95 and 0.97.

Step 2: Framing

Speech is a non-stationary signal, meaning its statistical properties change over time. However, over very short durations (e.g., 20-30 milliseconds), the signal is quasi-stationary. Therefore, we split the pre-emphasized signal into short, overlapping frames.

A standard configuration is to use a frame size of 25 ms with a stride, or hop length, of 10 ms. This means each frame is 25 ms long, and we advance by 10 ms to create the next frame, resulting in a 15 ms overlap. This overlap ensures that we don't lose information at the edges of each frame when we later apply a windowing function.

The continuous audio signal is segmented into overlapping frames.

Step 3: Windowing

After framing, a window function is applied to each frame. When we extract a finite frame from the signal, we create sharp discontinuities at its edges. Taking a Fourier Transform of this block would introduce high-frequency artifacts, a phenomenon known as spectral leakage.

To minimize this, we multiply each frame by a window function, such as the Hamming window. This function tapers the frame to zero at the beginning and end, smoothing the signal and reducing the edge discontinuities.

Step 4: Fast Fourier Transform (FFT)

With each frame windowed, we can now convert it from the time domain to the frequency domain. We do this by applying a Fast Fourier Transform (FFT) to each frame. The FFT computes the discrete Fourier transform and provides the frequency spectrum of the frame.

The output of the FFT is a set of complex numbers. For ASR, we are primarily interested in the magnitude of these frequencies, so we compute the power spectrum, often calculated as the squared magnitude of the FFT output:

P = \frac{|FFT(frame)|^2}{N}

Where $N$ is the length of the FFT. The result is an array of energy values for a range of frequency bins.

Step 5: Mel Filter Bank Application

The linear frequency scale from the FFT does not align with how humans perceive sound. We are much better at distinguishing between small changes in low frequencies than high frequencies. The Mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another.

To map the power spectrum to the Mel scale, we use a set of triangular filters, known as a Mel filter bank. This bank typically consists of 20-40 overlapping filters. Each filter is narrow at low frequencies and wider at high frequencies, reflecting the non-linear nature of human hearing. We multiply the power spectrum by each triangular filter and sum the energy to get the filter bank energy for that specific Mel-scaled band.

A small set of triangular filters in a Mel filter bank. Note how the filters become wider and more spread out at higher frequencies.

Step 6: Logarithm of Filter Bank Energies

The output of the previous step is a set of filter bank energies. Just as with frequency, human perception of loudness is logarithmic. To mimic this, we take the logarithm of all the filter bank energies.

This step has a useful side-effect: it helps to compress the dynamic range of the features, making them less sensitive to variations in signal amplitude. The result at this point is a log-mel spectrogram, which is a powerful feature in its own right and is often used directly as input for modern deep learning models.

Step 7: Discrete Cosine Transform (DCT)

The final step in calculating MFCCs is to apply the Discrete Cosine Transform (DCT) to the log filter bank energies. The filter bank energies are often highly correlated with each other because the triangular filters overlap. The DCT is a mathematical operation that decorrelates these energies, similar to how a Principal Component Analysis (PCA) works.

The result of the DCT is a set of coefficients where most of the signal's information is concentrated in the first few components.

c_i = \sqrt{\frac{2}{N}} \sum_{j=1}^{N} m_j \cos\left(\frac{\pi i}{N}(j-0.5)\right)

Where $c_i$ is the $i$ -th MFCC, $m_j$ is the log energy from the $j$ -th filter bank, and $N$ is the total number of filter banks.

We typically keep only a small number of these coefficients, for example, the first 13 (coefficients 2 through 13, as the first one often represents the signal offset). This reduction acts as a form of compression, retaining the most useful spectral information in a compact vector. The resulting set of coefficients for each frame are the Mel Frequency Cepstral Coefficients.

In practice, you will rarely implement these steps from scratch. Libraries like librosa in Python provide highly optimized functions to perform this entire calculation with a single line of code. For example:

import librosa

# Assume 'y' is the audio time series and 'sr' is the sample rate
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

Understanding this step-by-step process, however, is important for making informed decisions about feature engineering and for diagnosing issues in an ASR pipeline. Each step from pre-emphasis to the final DCT plays a part in shaping the features that your acoustic model will ultimately learn from.

References

Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2018 (Pearson) - A comprehensive textbook for speech and language processing, detailing the process of MFCC calculation and its significance in acoustic modeling for speech recognition.
Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, Steven B. Davis and Paul Mermelstein, 1980 IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 (IEEE) DOI: 10.1109/TASSP.1980.1163420 - A seminal paper that introduced and evaluated Mel-Frequency Cepstral Coefficients, establishing their effectiveness as features for speech recognition.
Discrete-Time Signal Processing, Alan V. Oppenheim, Ronald W. Schafer, John R. Buck, 1999 (Prentice Hall) - A fundamental textbook on discrete-time signal processing, detailing the theoretical concepts behind Fourier transforms, digital filters, and windowing functions, which are integral steps in MFCC calculation.
librosa.feature.mfcc, Brian McFee and the librosa developers, 2024 (librosa project) - Official documentation for the librosa library's MFCC function, demonstrating practical implementation and parameters for computing MFCCs in Python.