While simple Recurrent Neural Networks (RNNs) are designed for sequential data, they encounter significant limitations when processing the long and variable sequences found in speech. Their main weakness is the difficulty in learning and retaining information over extended time steps, a problem often referred to as managing long-range dependencies. This issue arises primarily from the vanishing gradient problem, where the influence of past inputs on the current output diminishes exponentially as the sequence gets longer.
For an ASR system, this is a serious drawback. The model must be able to connect sounds that occurred several seconds apart to form a coherent word or phrase. For instance, to correctly transcribe the end of a sentence, the model might need to recall information from the very beginning. Simple RNNs are not well-equipped for this task. To overcome these challenges, we turn to more sophisticated recurrent architectures: Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks.
LSTMs were explicitly designed to remember information for long periods. They introduce a cell state that acts as a conveyor belt of information, running straight down the entire sequence with only minor linear interactions. This structure makes it much easier for information to flow unchanged.
The magic of LSTMs lies in their ability to regulate this information flow using three specialized components called gates. These gates are small neural networks that learn which information is important to add, keep, or remove from the cell state.
Forget Gate: This gate decides what information to discard from the previous cell state. It looks at the previous hidden state ht−1 and the current input xt and outputs a number between 0 and 1 for each number in the previous cell state Ct−1. A 1 represents "completely keep this," while a 0 represents "completely get rid of this." In speech, this could mean forgetting the acoustic signature of a brief silence between words.
ft=σ(Wf⋅[ht−1,xt]+bf)Input Gate: This gate determines what new information will be stored in the cell state. It has two parts: a sigmoid layer that decides which values we'll update (the input gate), and a tanh layer that creates a vector of new candidate values, C~t, that could be added to the state. This is how the model incorporates new sounds, like the beginning of a new phoneme.
Output Gate: This gate determines the network's output. The output will be a filtered version of the cell state. First, we run a sigmoid layer to decide which parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between -1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
The new cell state Ct is calculated by combining the old state (multiplied by the forget gate) and the new candidate values (multiplied by the input gate).
Ct=ft∗Ct−1+it∗C~tThis gated mechanism allows the LSTM to selectively remember or forget information, making it exceptionally powerful for modeling the complex temporal dynamics of speech.
An LSTM cell. The top line represents the cell state, which carries information across time steps. The gates (in red) control how information is added to or removed from this state.
The Gated Recurrent Unit, or GRU, is a more recent and slightly simpler alternative to the LSTM. It combines the forget and input gates into a single update gate and merges the cell state and hidden state. This results in a model that is more computationally efficient.
A GRU cell has two main gates:
The simplified structure of GRUs means they have fewer parameters than LSTMs, which can make them faster to train and less prone to overfitting on smaller datasets. In many ASR tasks, GRUs have been shown to deliver performance comparable to LSTMs.
A GRU cell. It simplifies the LSTM design by combining gates and merging the cell and hidden states, leading to a more streamlined architecture.
To further enhance our model's ability to understand context, we can employ two additional strategies: bidirectionality and stacking.
Bidirectional RNNs In speech, context is not just historical; it is also forward-looking. The pronunciation or meaning of a word can be influenced by the words that follow it. For example, to distinguish between the homophones in "I read the book" and "I will read the book," the model benefits from seeing the entire sentence.
A bidirectional RNN (Bi-LSTM or Bi-GRU) processes the input sequence in two directions. One recurrent layer processes the sequence from start to finish (forward pass), while a second, independent layer processes it from finish to start (backward pass). At each time step, the outputs from both layers are concatenated to form the final representation. This allows the model to have a complete picture of the surrounding context for every point in the sequence.
Stacked RNNs Just like with other types of neural networks, we can increase the depth of our model by stacking recurrent layers on top of one another. In a stacked RNN, the output sequence of the first layer becomes the input sequence for the second layer, and so on.
This approach allows the network to learn hierarchical representations of the data. The first layer might learn to detect basic acoustic features and phonemes. The second layer could then learn to combine these phoneme representations into syllables or word fragments, and subsequent layers could learn even higher-level linguistic structures. A typical ASR acoustic model might use anywhere from 2 to 6 stacked bidirectional LSTM or GRU layers.
By using LSTMs or GRUs, often in a stacked, bidirectional configuration, we create an acoustic model that is highly effective at capturing the complex, long-range patterns of human speech. This powerful sequence-to-sequence architecture is the foundation upon which we will build our CTC-based training pipeline in the next section.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with