While the vanishing gradient problem causes learning signals to fade away over long sequences, its counterpart, the exploding gradient problem, presents a different kind of numerical instability during the training of Recurrent Neural Networks (RNNs). Instead of diminishing to zero, gradients can grow extremely large, sometimes resulting in values that overflow standard numerical representations, leading to NaN
(Not a Number) outputs and a complete breakdown of the training process.
The mechanism behind exploding gradients is closely related to the process of Backpropagation Through Time (BPTT) and the repeated application of the recurrent weight matrix. Recall that BPTT involves calculating gradients by propagating the error signal backward through the unrolled network, step by step.
Consider the update rule for the hidden state ht in a simple RNN:
ht=tanh(Wxhxt+Whhht−1+bh)When calculating the gradient of the loss L with respect to a hidden state hk far back in time, BPTT involves computing terms like ∂ht−1∂ht, ∂ht−2∂ht−1, and so on, back to ∂hk∂hk+1. Each of these partial derivatives involves multiplying by the recurrent weight matrix Whh (and the derivative of the activation function).
The gradient contribution from a later time step t to an earlier time step k (k<t) involves a product of Jacobian matrices:
∂hk∂ht=i=k+1∏t∂hi−1∂hiEach term ∂hi−1∂hi is approximately proportional to Whh. If the magnitudes of the leading eigenvalues (or more generally, singular values) of the recurrent weight matrix Whh are consistently greater than 1, this repeated multiplication across many time steps (t−k) can cause the resulting gradient values to increase exponentially.
Think of it like compound interest: if you repeatedly multiply a value by a factor slightly larger than 1, it grows very quickly. In BPTT, the gradient components associated with Whh are repeatedly multiplied. If these components tend to amplify the signal (magnitude > 1), the overall gradient can explode.
Exploding gradients manifest in several disruptive ways:
NaN
values appearing in the loss or weights.The gradient norm remains relatively stable during initial training iterations but then suddenly increases dramatically, indicating an exploding gradient event. Note the logarithmic scale on the y-axis used to accommodate the large spike.
Unlike vanishing gradients, which silently hinder the learning of long-range dependencies, exploding gradients often cause more obvious and catastrophic training failures. Fortunately, their dramatic effects make them easier to detect (e.g., by monitoring gradient norms or observing sudden NaN
losses). We will explore techniques like gradient clipping in the next section, which provide effective ways to manage this instability.
© 2025 ApX Machine Learning