Having explored methods for extracting numerical features from raw audio signals, we now turn our attention to modeling the inherent sequential nature and variability of speech. Speech, whether we're trying to recognize it (ASR) or generate it (TTS), is fundamentally a sequence of events unfolding over time. Furthermore, it's packed with uncertainty. Pronunciations vary, background noise intrudes, and speaking styles differ. Statistical models provide a mathematical framework for handling this sequential structure and uncertainty. While modern speech processing heavily relies on deep learning, understanding the statistical foundations helps appreciate the problems these advanced models solve and how they function.
At its core, Automatic Speech Recognition often frames the problem using Bayes' Theorem. We want to find the most likely word sequence W given an observed sequence of acoustic features A:
P(W∣A)=P(A)P(A∣W)P(W)Since we search for the most likely sequence W, and P(A) is constant for a given audio input, the task simplifies to maximizing the product of two probabilities:
W^=WargmaxP(A∣W)P(W)Here, P(A∣W) represents the acoustic model, estimating the likelihood of observing the acoustic features A if the word sequence W was spoken. P(W) represents the language model, estimating the prior probability of the word sequence W occurring in the target language. Early and even some current hybrid systems explicitly model these probabilities using statistical techniques.
For decades, Hidden Markov Models (HMMs) were the dominant approach for acoustic modeling, P(A∣W), in ASR. HMMs are well-suited for modeling sequences where the underlying generating process is not directly observable. In speech, the underlying (hidden) states might correspond to phonemes or sub-phonetic units, while the observations are the acoustic feature vectors (like MFCCs) extracted from the audio.
An HMM is defined by:
A simplified visualization of a 3-state Hidden Markov Model. Circles represent hidden states (e.g., phonemes), and rectangles represent observations (acoustic features). Solid arrows show state transitions (aij), while dashed arrows indicate the probability of emitting an observation from a state (bj(Ot)).
HMMs capture the temporal dependencies in speech through state transitions and model acoustic variability via emission probabilities. Training often involved the Expectation-Maximization (EM) algorithm, specifically the Baum-Welch algorithm, to estimate parameters A and B using Maximum Likelihood Estimation (MLE). The Viterbi algorithm was typically used during recognition (decoding) to find the most likely sequence of hidden states (and thus words) given the observations.
While powerful, traditional HMM-GMM systems have limitations:
These limitations motivated the shift towards deep learning models, which can learn more complex temporal dependencies and feature representations directly from data. However, HMM concepts remain relevant, particularly in hybrid systems (Chapter 2) where neural networks estimate HMM emission probabilities.
The other piece of the ASR puzzle is the language model (LM), P(W). Traditionally, this was handled by N-gram models. An N-gram model approximates the probability of the next word wi given the entire preceding history by conditioning only on the previous N−1 words:
P(wi∣w1,w2,...,wi−1)≈P(wi∣wi−N+1,...,wi−1)For example, a trigram model (N=3) approximates P(recognition∣speech) using P(recognition∣speech). These probabilities are typically estimated by counting occurrences of word sequences in large text corpora using MLE, often with smoothing techniques (like Kneser-Ney) to handle unseen N-grams.
While simple and computationally efficient, N-gram models suffer from data sparsity (many plausible word sequences never appear in the training data) and struggle to capture long-range dependencies in language. Modern ASR systems increasingly use neural language models (covered in Chapter 3) which overcome many of these limitations, but understanding N-grams provides context for LM integration techniques.
How do these statistical concepts relate to the deep learning architectures we'll focus on?
This review serves as a bridge. While we will predominantly use deep learning tools, remembering the underlying statistical problems they are designed to solve, modeling sequence probabilities, handling uncertainty, estimating likelihoods, provides a valuable perspective for understanding their architecture, training, and evaluation. We are essentially finding more powerful ways to estimate P(A∣W) and P(W) for ASR, or P(A∣T) (where T is text) for TTS, often learning representations and dependencies directly from data rather than relying on strong simplifying assumptions.
© 2025 ApX Machine Learning