After collecting thousands of examples of a specific phoneme, like /t/, and extracting their MFCC feature vectors, you would not find a single, identical vector for all of them. Instead, you would discover a cloud of data points. This variation comes from differences in pitch, speed, accent, and the influence of neighboring sounds. The task for the acoustic model is to create a mathematical description for this cloud of points for each phoneme.
A simple approach would be to model this cloud with a single Gaussian distribution (often called a normal distribution or bell curve). A Gaussian distribution describes data that clusters around an average value, or mean. While useful, a single Gaussian is often too rigid to capture the complex variations in how a phoneme is pronounced. For instance, the /t/ sound at the beginning of "top" might produce slightly different features than the /t/ in "stop," creating multiple clusters within the data cloud.
A more powerful and flexible method is to use a Gaussian Mixture Model (GMM). A GMM is not just one Gaussian distribution, but a combination of several. Think of it as using multiple, smaller bell curves to approximate a more complex shape. By combining them, a GMM can effectively model a distribution that has multiple peaks or is not perfectly symmetrical.
Each individual Gaussian in the mixture is called a "component," and it has its own mean and variance. The GMM also includes a "weight" for each component, which represents how much that specific Gaussian contributes to the overall model. This allows the GMM to fit the complex distribution of feature vectors for a single phoneme much more accurately than a single Gaussian could.
The diagram below shows how a GMM with three components can model a scatter plot of feature vectors that are grouped into distinct clusters.
A Gaussian Mixture Model uses multiple distributions (represented by the colored ovals) to capture the complex pattern of feature vectors for a single phoneme.
In a traditional ASR system, this process is repeated for every phoneme in a language. The system trains a unique GMM for /a/, another for /b/, another for /k/, and so on. Each GMM learns the specific statistical distribution of the feature vectors that correspond to its phoneme from a large dataset of transcribed audio.
Once trained, these GMMs are ready to be used for recognition. When the system processes a new, unknown segment of audio and extracts its feature vector, it asks a simple question to each GMM: "What is the probability that you generated this vector?"
The GMM for the phoneme /t/ will calculate a high likelihood score if the incoming vector falls within its learned distribution. In contrast, the GMMs for /d/ or /a/ will calculate very low scores. This gives us the desired probability, P(features∣phoneme), for all possible phonemes. For a given feature vector, the output might look like this:
This set of probabilities is exactly what the acoustic model needs to produce. However, GMMs alone have a significant limitation: they analyze each audio frame in isolation. They have no inherent understanding of time or sequence. Speech is fundamentally a sequence of sounds, not a random collection. To model this temporal flow, GMMs were paired with another statistical tool called Hidden Markov Models (HMMs), which we will cover next.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with