While LSTMs and GRUs introduced gating mechanisms to combat the severe vanishing gradient problem found in simple RNNs, they still face inherent difficulties in capturing relationships between elements that are very far apart in a sequence. This limitation stems directly from the sequential nature of processing information.
Think of the hidden state ht in an RNN as a running summary or memory of the sequence seen up to time step t. To influence the output or state at a much later time step N, information from an early time step t must be successfully propagated through all intermediate steps t+1,t+2,…,N−1. This sequential path acts as a bottleneck.
Mathematically, this relates back to the propagation of gradients during backpropagation through time. The gradient of the loss L with respect to an early hidden state ht depends on a product of Jacobian matrices representing the state transitions at each step:
∂ht∂L=∂hN∂L(k=t+1∏N∂hk−1∂hk)Even with LSTMs and GRUs designed to keep these Jacobian norms closer to 1 on average, propagating information over a very large number of steps (N−t being large) remains challenging. The gates provide control over information flow, allowing the network to potentially preserve important information over longer durations than a simple RNN. However, this control isn't perfect.
Consider tasks where this limitation becomes apparent:
The following diagram illustrates the sequential dependency path inherent in recurrent models. Information from Input 1
must pass through every intermediate hidden state to influence Output N
.
The sequential path length in RNNs for information flow between time steps
t
andN
is proportional toN-t
.
While LSTMs and GRUs were significant improvements, this persistent difficulty in modeling very long-range dependencies directly motivated the development of architectures that could create shorter paths between distant sequence elements. The Transformer model, primarily through its self-attention mechanism, provides a way to directly model relationships between any two positions in the sequence, regardless of their distance, overcoming this fundamental limitation of recurrent processing. We will examine this mechanism in detail in the next chapter.
© 2025 ApX Machine Learning