Combining GMMs and HMMs

Gaussian Mixture Models (GMMs) can model the distribution of audio features for a single phoneme, while Hidden Markov Models (HMMs) can represent sequences. On their own, each model has a significant limitation. A GMM has no understanding of time or sequence; it can only tell you how well a single audio frame fits a phoneme's sound profile. An HMM, on the other hand, understands sequences but has no native way to connect its states to the continuous, complex data of an audio signal.

The solution is to combine them into a single, more powerful architecture: the GMM-HMM. This hybrid model was the standard in speech recognition for many years and provides a solid foundation for understanding how ASR systems work.

The GMM as the "Emission" Engine for the HMM

Think of the HMM as a state machine where each state represents a phoneme. For the system to work, it needs to answer two questions at every step:

Transition: What is the probability of moving from the current phoneme state to the next one? (This is the HMM's job).
Emission: Given that we are in a specific phoneme state, what is the probability that the audio frame we are currently observing was generated by this phoneme?

This second question is where the GMM comes in. In a GMM-HMM, each state of the HMM is associated with its own GMM. The GMM for a particular phoneme, like /t/, is trained only on audio frames that correspond to the /t/ sound.

When the ASR system is evaluating a segment of audio, the HMM proposes a sequence of states (phonemes). For each state in the sequence, it asks the corresponding GMM to calculate the probability of the observed audio features. This probability is called the emission probability.

So, the HMM handles the sequence ( $P(\mathrm{next\_state} \mid \mathrm{current\_state})$ ), and the GMM handles the observation likelihood at each state ( $P(\mathrm{audio\_features} \mid \mathrm{state})$ ).

How the GMM-HMM Works in Practice

Let's trace how a GMM-HMM would process the audio for the word "cat" (/k/ /æ/ /t/). The system uses an HMM where states correspond to these phonemes.

State /k/: The first few frames of audio (corresponding to the "k" sound) are fed into the system. The HMM is in the /k/ state. The GMM specifically trained for the /k/ phoneme evaluates these frames. It calculates a high probability, confirming that these audio features are a good match for the /k/ sound. GMMs for other phonemes, like /æ/ or /s/, would return a very low probability for these same frames.
Transition to /æ/: The HMM knows from its training data that a transition from /k/ to /æ/ is common in English. It moves to the /æ/ state.
State /æ/: The next set of audio frames (for the "a" sound) are now evaluated by the GMM associated with the /æ/ state. This GMM finds a strong match and outputs a high emission probability.
Transition and State /t/: The process repeats. The HMM transitions to the /t/ state, and the GMM for /t/ successfully validates the final audio frames of the word.

The total probability of the path /k/ -> /æ/ -> /t/ is calculated by multiplying the transition probabilities and emission probabilities along the way. The decoder, which you will learn about later, is responsible for finding the sequence of states that has the highest overall probability.

The diagram below illustrates this relationship. Each HMM state, representing a phoneme, contains a GMM responsible for calculating the probability of observing the audio features at that point in time.

The HMM determines the likely sequence of phoneme states, while the GMM within each state calculates the probability that the observed audio features match that specific phoneme.

By combining these two models, the GMM-HMM system effectively models both the statistical properties of individual speech sounds and the sequential, time-dependent nature of language. This architecture proved to be extremely effective and became the workhorse of the speech recognition field for decades before the rise of end-to-end deep learning methods.

Was this section helpful?

References

Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - This online draft textbook chapter covers the GMM-HMM architecture as a standard acoustic model in speech recognition.
Fundamentals of Speech Recognition, Lawrence R. Rabiner, Biing-Hwang Juang, 1993 (Prentice Hall) - A classic textbook that provides comprehensive explanations of HMMs, GMMs, and their combination for speech recognition.
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Lawrence R. Rabiner, 1989 Proceedings of the IEEE, Vol. 77 (IEEE) DOI: 10.1109/5.18626 - A seminal paper introducing Hidden Markov Models and their early applications in speech recognition.
Pattern Recognition and Machine Learning, Christopher M. Bishop, 2006 (Springer) DOI: 9780387310732 - This book offers a mathematical and probabilistic treatment of Gaussian Mixture Models and Hidden Markov Models.