Speech recognition systems rely on extracted features, such as MFCCs or log-mel spectrograms. These raw feature values can vary significantly due to factors unrelated to the spoken content, such as microphone type, recording distance, and background noise level. Feature normalization is a preprocessing step designed to address this variability, ensuring that the model learns the phonetic content of speech rather than incidental recording conditions. By scaling features to a consistent range, normalization helps stabilize the training process and improve the model's ability to generalize to new, unseen data.
The most common normalization technique used in speech recognition is Cepstral Mean and Variance Normalization, or CMVN. The objective of CMVN is to transform the feature vectors so that they have a mean of zero and a variance of one. This process, also known as standardization, removes shifts in the mean and variance of the features that can be caused by the recording channel or speaker characteristics.
Applying CMVN involves calculating the mean and standard deviation for each feature coefficient across a set of frames and then using these statistics to normalize each frame. This can be done in a few different ways, depending on the scope of the statistics calculation.
The diagram below illustrates the process for utterance-level CMVN, where each utterance's feature matrix is normalized independently.
The process of utterance-level Cepstral Mean and Variance Normalization.
For a given feature coefficient dimension (for example, the 4th MFCC coefficient), the mean is calculated across all time frames of an utterance or set of utterances:
Here, is the value of the -th coefficient at time frame . The standard deviation is calculated as the square root of the variance:
Finally, each coefficient is normalized to produce the new feature value :
This calculation is performed independently for each coefficient in the feature vector. For example, if you are using 13 MFCCs, you will compute 13 separate means and 13 separate standard deviations.
The chart below shows a histogram of values for a single feature coefficient before and after CMVN. Before normalization, the feature distribution is centered around a mean of 5.0. After applying CMVN, the distribution is centered at 0 with a standard deviation of 1.
Effect of CMVN on the distribution of a single feature coefficient. The normalized feature is centered at zero.
When implementing a feature extraction pipeline, the choice of normalization strategy has practical implications.
If you choose global CMVN, it is extremely important to compute the mean and standard deviation statistics only from the training data. Saving these statistics and applying them to the validation and test sets prevents information from the test set from "leaking" into your training process, which would give you an overly optimistic evaluation of your model's performance.
For utterance-level CMVN, the process is simpler as each file is handled independently. This can be beneficial for applications where the recording conditions are expected to change frequently.
While modern neural network architectures often include internal normalization layers like BatchNorm or LayerNorm, applying CMVN directly to the input features remains a common and beneficial practice. It provides a clean, standardized input that can reduce the burden on the initial layers of the network, often leading to faster convergence and better overall performance. This step ensures that the acoustic model can focus on learning the mapping from phonetically relevant patterns to text, without being distracted by extrinsic acoustic variability.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with