Gaussian Mixture Models (GMMs) can model the distribution of audio features for a single phoneme, while Hidden Markov Models (HMMs) can represent sequences. On their own, each model has a significant limitation. A GMM has no understanding of time or sequence; it can only tell you how well a single audio frame fits a phoneme's sound profile. An HMM, on the other hand, understands sequences but has no native way to connect its states to the continuous, complex data of an audio signal.
The solution is to combine them into a single, more powerful architecture: the GMM-HMM. This hybrid model was the standard in speech recognition for many years and provides a solid foundation for understanding how ASR systems work.
Think of the HMM as a state machine where each state represents a phoneme. For the system to work, it needs to answer two questions at every step:
This second question is where the GMM comes in. In a GMM-HMM, each state of the HMM is associated with its own GMM. The GMM for a particular phoneme, like /t/, is trained only on audio frames that correspond to the /t/ sound.
When the ASR system is evaluating a segment of audio, the HMM proposes a sequence of states (phonemes). For each state in the sequence, it asks the corresponding GMM to calculate the probability of the observed audio features. This probability is called the emission probability.
So, the HMM handles the sequence (P(next_state∣current_state)), and the GMM handles the observation likelihood at each state (P(audio_features∣state)).
Let's trace how a GMM-HMM would process the audio for the word "cat" (/k/ /æ/ /t/). The system uses an HMM where states correspond to these phonemes.
/k/: The first few frames of audio (corresponding to the "k" sound) are fed into the system. The HMM is in the /k/ state. The GMM specifically trained for the /k/ phoneme evaluates these frames. It calculates a high probability, confirming that these audio features are a good match for the /k/ sound. GMMs for other phonemes, like /æ/ or /s/, would return a very low probability for these same frames./æ/: The HMM knows from its training data that a transition from /k/ to /æ/ is common in English. It moves to the /æ/ state./æ/: The next set of audio frames (for the "a" sound) are now evaluated by the GMM associated with the /æ/ state. This GMM finds a strong match and outputs a high emission probability./t/: The process repeats. The HMM transitions to the /t/ state, and the GMM for /t/ successfully validates the final audio frames of the word.The total probability of the path /k/ -> /æ/ -> /t/ is calculated by multiplying the transition probabilities and emission probabilities along the way. The decoder, which you will learn about later, is responsible for finding the sequence of states that has the highest overall probability.
The diagram below illustrates this relationship. Each HMM state, representing a phoneme, contains a GMM responsible for calculating the probability of observing the audio features at that point in time.
The HMM determines the likely sequence of phoneme states, while the GMM within each state calculates the probability that the observed audio features match that specific phoneme.
By combining these two models, the GMM-HMM system effectively models both the statistical properties of individual speech sounds and the sequential, time-dependent nature of language. This architecture proved to be extremely effective and became the workhorse of the speech recognition field for decades before the rise of end-to-end deep learning methods.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with