Training recurrent networks typically involves an algorithm called Backpropagation Through Time (BPTT). You can think of this as "unrolling" the recurrent network through its sequence steps. This creates what looks like a very deep feedforward network, but with a twist: the weights are shared across all the time steps (layers in the unrolled view). When calculating gradients, the error signal must flow backward through this entire unrolled structure, from the final output all the way back to the initial steps.
Propagating gradients back through potentially many time steps in BPTT introduces significant challenges. The two most prominent issues are the vanishing gradient problem and the exploding gradient problem.
The core idea behind gradient calculation is the chain rule from calculus. In BPTT, the gradient of the loss function with respect to parameters or hidden states at earlier time steps (e.g., ht) involves multiplying many Jacobian matrices (matrices of partial derivatives) together, one for each step the gradient flows backward through:
∂ht∂L≈∂hT∂L(k=t+1∏T∂hk−1∂hk)Each term ∂hk−1∂hk represents how the hidden state at step k changes with respect to the hidden state at step k−1. This depends on the recurrent weights and the derivatives of the activation functions used.
If the values in these Jacobian matrices tend to be small (specifically, if their norms or leading eigenvalues are consistently less than 1), this long product of matrices will shrink exponentially as the gradient propagates further back in time (T−t increases). The gradients effectively "vanish" to near-zero values before they reach the layers corresponding to much earlier time steps.
What's the consequence? The network becomes incapable of learning relationships between elements that are far apart in the sequence. If an output at time T depends heavily on an input or state at a much earlier time t, the vanishing gradient prevents the error signal from meaningfully informing the weights connected to that early input. The network essentially forgets the beginning of the sequence by the time it processes the end. This severely limits the RNN's ability to model long-range dependencies, which is often the primary goal when dealing with sequential data like long sentences or extended time series.
The opposite problem occurs if the values in the Jacobian matrices ∂hk−1∂hk are consistently large (norms or leading eigenvalues greater than 1). In this scenario, the long product of matrices grows exponentially as gradients propagate backward through time. The gradients "explode," reaching extremely large numerical values.
Conceptual illustration of how gradient magnitudes might change as they propagate backward through time steps in BPTT. Vanishing gradients shrink towards zero, while exploding gradients grow rapidly. (Note: Log scale on y-axis).
What does this mean for training? Exploding gradients result in excessively large updates to the network's weights during optimization. This destabilizes the learning process, often causing the model's parameters to oscillate wildly or diverge entirely. You might observe the training loss suddenly shooting up or becoming NaN
(Not a Number) as values exceed the limits of standard floating-point representations.
Exploding gradients are generally easier to detect and mitigate than vanishing gradients. A common technique is gradient clipping, where the gradients are manually scaled down if their overall norm exceeds a predefined threshold before the weight update step. This prevents the huge updates that destabilize training.
Despite clipping helping with explosions, the vanishing gradient problem remained a more fundamental obstacle for simple RNNs attempting to learn long temporal patterns. These inherent difficulties in training basic RNNs spurred the development of more sophisticated recurrent architectures, such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs). These models incorporate gating mechanisms specifically designed to better control the flow of information and gradients over extended sequences, which we will provide a conceptual overview of next.
© 2025 ApX Machine Learning