Masterclass
While simple Recurrent Neural Networks (RNNs), as discussed in the previous section, provide a mechanism for processing sequences by maintaining a hidden state, they encounter significant difficulties when dealing with longer sequences. These limitations were a primary driver for the development of more complex architectures like LSTMs and GRUs, and ultimately, the Transformer. The core issues stem from how gradients propagate through the network over many time steps during training.
Training RNNs typically involves Backpropagation Through Time (BPTT). This algorithm unfolds the RNN over the sequence length and applies standard backpropagation. To calculate the gradient of the loss function with respect to parameters used early in the sequence (e.g., weights influencing the hidden state hk), the chain rule requires multiplying many Jacobian matrices corresponding to the recurrent state transition.
Let L be the total loss over the sequence. The gradient of the loss with respect to an early hidden state hk depends on the gradients from all later states ht (t>k):
∂hk∂L=t=k+1∑T∂ht∂Lt∂hk∂htWhere T is the sequence length and Lt is the loss at time step t. The term ∂hk∂ht represents the influence of hk on ht. This influence is calculated by chaining the Jacobians of the state transition:
∂hk∂ht=∂ht−1∂ht∂ht−2∂ht−1…∂hk∂hk+1Each Jacobian ∂hi−1∂hi depends on the recurrent weight matrix Whh and the derivative of the activation function (e.g., tanh′). If the magnitudes (specifically, singular values) of these Jacobians are consistently less than 1, their product can shrink exponentially as the distance t−k increases.
Illustration of Backpropagation Through Time (BPTT). Gradients from later time steps (like T) must flow backward through the recurrent connections (influenced by Whh) to update parameters affecting earlier states (like hk). The repeated multiplication of Jacobians during this backward pass is the source of vanishing/exploding gradients.
This phenomenon is known as the vanishing gradient problem. The consequence is that the error signal from later time steps becomes too weak to significantly update the weights responsible for capturing information from much earlier steps. Effectively, the RNN struggles to learn long-range dependencies in the data. It becomes difficult for the model to connect events or information separated by many time steps in the sequence.
Conversely, if the Jacobians ∂hi−1∂hi have magnitudes consistently greater than 1, their product can grow exponentially large. This leads to the exploding gradient problem.
When gradients explode, the weight updates become excessively large, potentially causing the model's parameters to diverge to infinity or NaN (Not a Number). This destabilizes the training process, often resulting in sudden spikes in the loss function or complete training failure.
While exploding gradients are often easier to detect and can sometimes be mitigated using techniques like gradient clipping (scaling gradients down if their norm exceeds a certain threshold), clipping is more of a bandage than a cure. It prevents catastrophic divergence but doesn't fundamentally solve the underlying issue of propagating informative gradient signals across long temporal distances, which is the core limitation addressed by vanishing gradients.
import torch
import torch.nn as nn
# Assume model parameters 'params' and computed 'loss'
# Gradient Clipping Example in PyTorch
# 1. Calculate gradients
# loss.backward()
# 2. Define maximum gradient norm threshold
max_grad_norm = 1.0
# 3. Clip gradients in-place
nn.utils.clip_grad_norm_(params, max_grad_norm)
# 4. Apply optimizer step
# optimizer.step()
Simple PyTorch example demonstrating gradient clipping. After computing gradients,
clip_grad_norm_
scales them down if their overall norm exceedsmax_grad_norm
.
These gradient issues severely limit the practical effectiveness of simple RNNs on tasks requiring the modeling of long sequences or capturing dependencies spanning many steps. For instance:
The inability to reliably learn these long-range dependencies means simple RNNs often perform poorly compared to models designed to overcome these gradient propagation challenges. This sets the stage for the development of LSTMs and GRUs, which incorporate gating mechanisms specifically designed to better control the flow of information and gradients through time, as we will see in the following sections.
© 2025 ApX Machine Learning