For many years, Mel Frequency Cepstral Coefficients (MFCCs) were the industry standard for representing audio in speech recognition systems. Their design is directly inspired by how humans perceive sound, making them a powerful and compact way to represent the parts of speech that are most important for distinguishing between different phonemes. Even with the rise of newer feature types, understanding MFCCs is essential for anyone working in speech technology, as they provide a strong baseline and illuminate the core principles of feature engineering for audio.
The primary idea behind MFCCs is to transform a signal's frequency representation to better match the non-linear response of the human ear. Our hearing is more sensitive to changes in lower frequencies than in higher ones. For example, the perceived difference between 100 Hz and 200 Hz is much greater than the difference between 10,000 Hz and 10,100 Hz, even though the absolute difference is the same. The Mel scale is a perceptual scale of pitches that formalizes this observation.
At a high level, generating MFCCs involves a multi-step process that takes a raw audio signal and converts it into a sequence of feature vectors. Each step in this pipeline is designed to extract and refine the phonetically relevant information while discarding noise and other irrelevant details like the speaker's fundamental frequency (pitch).
The following diagram illustrates the standard workflow for computing MFCCs.
The process of converting a raw audio signal into Mel Frequency Cepstral Coefficients. Each step refines the representation, moving from the time domain to a compact set of features.
We will explore the implementation details of each step in the next section, but let's first build an intuition for the two most important transformations in this pipeline: the Mel scale and the cepstrum.
The Mel scale converts physical frequencies, measured in Hertz (Hz), into a scale that aligns with human auditory perception. The conversion formula is:
Here, is the physical frequency in Hz, and is the perceived frequency in Mels. This logarithmic relationship causes frequencies below 1000 Hz to be spaced out more, while frequencies above 1000 Hz are compressed.
To apply this to our signal, we use a set of triangular filters, known as a Mel filterbank, spread across the frequency spectrum. These filters are narrow and closely spaced at low frequencies but become wider and more spread out at higher frequencies, effectively mimicking the response of the human cochlea.
The relationship between the linear Hertz scale and the perceptual Mel scale. The Mel scale's curve shows how it dedicates more resolution to lower frequencies, mirroring human hearing.
When we pass the power spectrum of a speech frame through this filterbank, we get a vector of energy values, one for each filter. This is called a log-mel spectrogram, which is itself a popular feature for ASR and the precursor to MFCCs.
After applying the Mel filterbank and taking the logarithm, we have a smoothed representation of the spectral envelope. However, the resulting feature vectors are often highly correlated with each other. The final step in the MFCC pipeline, the Discrete Cosine Transform (DCT), is used to decorrelate these vectors and compress the information into a few essential coefficients.
Applying the DCT to the log-mel spectrogram is a mathematical trick that brings us into the cepstral domain. The term "cepstrum" is an anagram of "spectrum," and it is calculated by taking the Fourier Transform (or in this case, the highly related DCT) of the log-spectrum. The result of this operation has a unique property: it separates the characteristics of the sound source (the vocal cords vibrating, which creates pitch) from the filter (the shape of the vocal tract, which creates phonemes).
The lower-order DCT coefficients represent the slow-changing shape of the spectral envelope, which corresponds to the identity of the phoneme (e.g., /a/, /t/, /sh/). The higher-order coefficients represent the fast-changing details, which are often related to pitch and excitation. For speech recognition, the phonetic information is what matters most.
Therefore, we typically keep only the first 13 to 40 coefficients and discard the rest. The resulting vector is a single MFCC feature for one frame of audio. By stacking these vectors from all the frames, we create the final feature matrix that serves as the input to our acoustic model. In summary, MFCCs are effective because they provide a compact, decorrelated, and perceptually meaningful representation of the phonetic content of speech.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with