While traditional Mel-Frequency Cepstral Coefficients (MFCCs) have served as a workhorse for decades in speech processing, modern deep learning models often benefit from richer, less processed input representations. As we move towards more sophisticated architectures, understanding advanced feature extraction techniques becomes essential for maximizing performance. These methods aim to either retain more information from the original signal or allow the model itself to learn the most effective representations directly from the data.
Before applying the Discrete Cosine Transform (DCT) to produce MFCCs, the process involves calculating the energy within a set of overlapping triangular filters applied to the power spectrum. These filters are spaced according to the Mel scale, which approximates human auditory perception.
The intermediate output, typically the logarithm of these filter bank energies (often called log-Mel spectrograms, Mel-frequency spectral coefficients, or FBank features), has become a standard input for many contemporary deep learning systems for ASR and TTS.
The calculation steps are:
Here, S(f) is the spectrum magnitude at frequency f, and Mk(f) is the response of the k-th Mel filter at frequency f.
Why use log-Mel energies instead of MFCCs?
Using log-Mel filter bank energies (commonly 40 or 80 filters) provides a good balance between dimensionality reduction and information preservation, serving as a strong baseline for many advanced models.
Overlapping triangular filters spaced on the Mel scale. Higher frequency filters typically have wider bandwidths.
A significant advancement is the concept of learned features, where the feature extraction process itself is integrated into the neural network and optimized during training. Instead of relying on fixed, predefined filter banks like the Mel scale, the network learns the optimal filters for the specific task and dataset.
One popular approach uses 1D convolutional layers applied directly to the raw audio waveform or a minimally processed version. These initial layers act as a learnable filter bank.
SincNet: This architecture parameterizes the filters in the first convolutional layer using sinc functions. A sinc function sinc(x)=sin(πx)/(πx) defines an ideal rectangular filter in the frequency domain. SincNet learns the low and high cutoff frequencies for each filter, effectively learning a bank of bandpass filters. This parameterization is efficient (only two parameters per filter) and encourages the learning of meaningful, interpretable filters.
g[n;f1,f2]=2f2sinc(2πf2n)−2f1sinc(2πf1n)Here g[n] represents the filter taps in the time domain, learned via its cutoff frequencies f1 and f2.
LEAF (Learnable Frontend): An evolution of SincNet, LEAF uses learnable Gabor-like filters implemented via Gaussian low-pass filters. It offers potentially more flexibility than the strict bandpass shapes enforced by SincNet and has shown strong performance.
These learnable front-ends replace the fixed STFT and Mel filter bank stages. The output of this initial layer (after pooling and activation, e.g., log compression) is then fed into the subsequent layers of the main acoustic model (like LSTMs or Transformers).
Advantages:
Considerations:
While Mel-scale features are dominant, other representations based on human auditory perception exist:
For many speech tasks, especially Text-to-Speech synthesis and analyzing tonal languages or prosody in ASR, fundamental frequency (F0), or pitch, is a significant feature. Pitch information is largely lost in standard spectral representations like MFCCs or log-Mel energies.
Pitch is typically estimated using specialized algorithms (e.g., YIN, pYIN, CREPE, RAPT) applied to the waveform or spectrum. The resulting F0 contour (pitch value per frame) is often:
Including pitch can significantly improve the naturalness of synthesized speech and provide valuable cues for distinguishing words in tonal languages or understanding sentence modality (question vs. statement) in ASR.
The ultimate expression of learned features is to feed the raw audio waveform samples directly into the neural network, bypassing all traditional signal processing steps. Models like wav2vec, wav2vec 2.0, HuBERT, and some end-to-end TTS systems (often incorporating WaveNet-like components) operate directly on sequences of audio samples.
Motivation:
Challenges:
While learned front-ends and raw waveform models represent the state-of-the-art in many research benchmarks, log-Mel filter bank energies remain a highly competitive and widely used feature representation in practical ASR and TTS systems. They offer a strong balance of information content, dimensionality, and computational feasibility.
The choice often depends on:
Understanding these advanced feature options allows you to make informed decisions when designing or adapting speech processing systems, moving beyond default choices to potentially unlock higher performance by providing richer or more tailored information to your models. These features form the input foundation upon which the statistical and deep learning models we discuss next will operate.
© 2025 ApX Machine Learning