As discussed in the chapter introduction, while the acoustic model provides the probability of the audio features given a word sequence, P(X∣W), the language model (LM) provides the prior probability of the word sequence itself, P(W). Combining these helps the ASR system identify sequences that are not only acoustically plausible but also linguistically likely. Traditional LMs, like n-gram models, estimate P(W) by looking at short, fixed-length histories of preceding words (e.g., trigrams consider the previous two words). While computationally efficient and effective to some extent, n-gram models suffer from two main limitations:
Neural Language Models (NLMs) address these limitations by using continuous representations of words (embeddings) and neural network architectures to model the probability distribution of the next word given a potentially much longer history.
Recurrent Neural Networks are naturally suited for sequence modeling tasks like language modeling. An RNN-LM processes the input sequence word by word, maintaining a hidden state vector that summarizes the information seen so far.
At each time step t, the RNN takes the embedding of the current word wt and the previous hidden state ht−1 as input and computes the new hidden state ht. This hidden state is then typically passed through a linear layer followed by a softmax function to produce a probability distribution over the entire vocabulary for the next word, wt+1.
ht=f(Whhht−1+Wxhxt+bh) P(wt+1∣w1,...,wt)=softmax(Whyht+by)
Here, xt is the embedding for word wt, f is a non-linear activation function (like tanh or ReLU), and Whh,Wxh,Why,bh,by are learnable weight matrices and biases.
Variations like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) are commonly used instead of simple RNNs because they incorporate gating mechanisms that help mitigate the vanishing gradient problem, allowing them to learn longer-range dependencies more effectively.
A basic recurrent neural network (RNN) structure for language modeling. The hidden state
h(t)
(blue) is computed based on the current input wordw(t)
(cyan) and the previous hidden stateh(t-1)
. This state is then used to predict the probability distributionP(w(t+1)|...)
(orange) for the subsequent word.
RNN-LMs offer significant advantages over n-grams:
More recently, Transformer architectures, based entirely on self-attention mechanisms, have become dominant in many NLP tasks, including language modeling. Unlike RNNs which process sequences sequentially, Transformers can process all words in the context in parallel (during training).
The self-attention mechanism allows the model to weigh the importance of different words in the context when predicting the next word, regardless of their distance. This makes Transformers particularly adept at capturing very long-range dependencies. A typical Transformer LM uses a stack of decoder layers, where each layer applies multi-head self-attention followed by feed-forward networks.
Transformer LMs often achieve state-of-the-art results in perplexity (a common measure of LM performance) and have demonstrated remarkable capabilities in text generation.
Regardless of whether an RNN or Transformer architecture is used, the NLM's primary role in ASR is to provide the P(W) score during the decoding process (typically beam search). The ASR system seeks the word sequence W∗ that maximizes a combination of the acoustic model score and the language model score:
W∗=argmaxWP(X∣W)P(W)λ or, more commonly in log-space: W∗=argmaxWlogP(X∣W)+λlogP(W)
Here, λ is the language model weight (sometimes called the LM scale factor), a hyperparameter that controls the influence of the LM relative to the acoustic model. It is typically tuned on a development set.
During beam search:
W_prefix
with a potential next word w_next
, the NLM is queried to compute P(wnext∣Wprefix).Querying a large NLM for every possible next word for every hypothesis in the beam can be computationally expensive compared to querying an n-gram model (which often involves simpler lookups in a pre-compiled structure). This computational cost is a significant consideration when integrating NLMs into ASR systems, especially for real-time applications. Techniques like caching NLM states for common prefixes or using specialized hardware can help mitigate this.
The methods for combining these scores, such as shallow fusion and deep fusion, will be discussed in the next section. These techniques define how and at what stage the NLM score is incorporated into the overall ASR decoding graph or search process. Using NLMs generally leads to substantial reductions in Word Error Rate (WER) compared to traditional n-gram models, especially for tasks with complex language or long sentences, justifying the added computational requirements in many applications.
© 2025 ApX Machine Learning