While the sequential nature of RNNs allows them to process sequences step-by-step, it creates significant hurdles during training, particularly for deep networks or long sequences. The primary mechanism for training RNNs is Backpropagation Through Time (BPTT), which essentially unfolds the recurrent network across the sequence length and applies the standard backpropagation algorithm. This process involves calculating gradients of the loss function with respect to the network's parameters by repeatedly applying the chain rule backward through each time step.
Consider the calculation of the gradient of the loss L at the final time step T with respect to the hidden state hk at some earlier time step k. Using the chain rule, this involves multiplying Jacobians across intermediate time steps:
∂hk∂L=∂hT∂L∂hT−1∂hT∂hT−2∂hT−1…∂hk∂hk+1This equation reveals the core issue. The term ∂ht−1∂ht represents how the hidden state at time t changes with respect to the hidden state at time t−1. This Jacobian depends on the recurrent weight matrix Whh and the derivative of the activation function used in the recurrent transition. The gradient signal from the final loss must propagate backward through a product of these Jacobian matrices.
If the norms (or more formally, singular values) of these Jacobian matrices ∂ht−1∂ht are consistently less than 1, their product will shrink exponentially as it propagates backward through time (T−k steps).
∂hk∂L≈∂hT∂Lt=k+1∏TJtwhere Jt=∂ht−1∂htIf ∣∣Jt∣∣<1 on average, then ∣∣∏t=k+1TJt∣∣ approaches zero very quickly as T−k increases. This means the gradient signal effectively vanishes before reaching the earlier time steps.
Consequences:
The choice of activation functions like hyperbolic tangent (tanh) or sigmoid, commonly used in older RNNs, exacerbates this problem because their derivatives are strictly less than 1 (except at a single point for tanh).
Conversely, if the norms of the Jacobian matrices ∂ht−1∂ht are consistently greater than 1, their product can grow exponentially large as it propagates backward.
Consequences:
While exploding gradients are often easier to detect and mitigate (e.g., using gradient clipping, where gradients exceeding a certain threshold are scaled down), they still pose a significant challenge to stable training.
The chart shows how repeated multiplication of values slightly less than 1 (blue line, norm 0.8) leads to exponential decay (vanishing), while values slightly greater than 1 (red line, norm 1.2) lead to exponential growth (exploding) as gradients propagate backward through time steps. The y-axis is clipped to visualize both trends.
These gradient problems fundamentally limit the ability of simple RNN architectures to model sequences effectively, especially when long-term patterns are present. This limitation was a major driving force behind the development of more sophisticated recurrent units like LSTMs and GRUs, which we examine next.
© 2025 ApX Machine Learning