While the vanishing gradient problem causes learning signals to fade away over long sequences, its counterpart, the exploding gradient problem, presents a different kind of numerical instability during the training of Recurrent Neural Networks (RNNs). Instead of diminishing to zero, gradients can grow extremely large, sometimes resulting in values that overflow standard numerical representations, leading to NaN (Not a Number) outputs and a complete breakdown of the training process.

Why Do Gradients Explode?

The mechanism behind exploding gradients is closely related to the process of Backpropagation Through Time (BPTT) and the repeated application of the recurrent weight matrix. Recall that BPTT involves calculating gradients by propagating the error signal backward through the unrolled network, step by step.

Consider the update rule for the hidden state $h_t$ in a simple RNN:

h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)

When calculating the gradient of the loss $L$ with respect to a hidden state $h_k$ far back in time, BPTT involves computing terms like $\frac{\partial h_t}{\partial h_{t-1}}$ , $\frac{\partial h_{t-1}}{\partial h_{t-2}}$ , and so on, back to $\frac{\partial h_{k+1}}{\partial h_k}$ . Each of these partial derivatives involves multiplying by the recurrent weight matrix $W_{hh}$ (and the derivative of the activation function).

The gradient contribution from a later time step $t$ to an earlier time step $k$ ( $k < t$ ) involves a product of Jacobian matrices:

\frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}}

Each term $\frac{\partial h_i}{\partial h_{i-1}}$ is approximately proportional to $W_{hh}$ . If the magnitudes of the leading eigenvalues (or more generally, singular values) of the recurrent weight matrix $W_{hh}$ are consistently greater than 1, this repeated multiplication across many time steps ( $t-k$ ) can cause the resulting gradient values to increase exponentially.

Think of it like compound interest: if you repeatedly multiply a value by a factor slightly larger than 1, it grows very quickly. In BPTT, the gradient components associated with $W_{hh}$ are repeatedly multiplied. If these components tend to amplify the signal (magnitude > 1), the overall gradient can explode.

Consequences for Training

Exploding gradients manifest in several disruptive ways:

Numerical Overflow: The gradient values become so large they exceed the representational capacity of floating-point numbers, often resulting in NaN values appearing in the loss or weights.
Drastic Weight Updates: Even if numerical overflow doesn't occur, the enormous gradients lead to excessively large updates to the network's weights during gradient descent. Imagine trying to descend a hill by taking giant leaps; you're likely to overshoot the minimum drastically and end up further away. This makes the optimization process unstable, and the weights might oscillate wildly or diverge.
Training Failure: The model fails to learn effectively. The cost function might fluctuate erratically or increase indefinitely, preventing convergence to a useful solution.

The gradient norm remains relatively stable during initial training iterations but then suddenly increases dramatically, indicating an exploding gradient event. Note the logarithmic scale on the y-axis used to accommodate the large spike.

Unlike vanishing gradients, which silently hinder the learning of long-range dependencies, exploding gradients often cause more obvious and catastrophic training failures. Fortunately, their dramatic effects make them easier to detect (e.g., by monitoring gradient norms or observing sudden NaN losses). We will explore techniques like gradient clipping in the next section, which provide effective ways to manage this instability.

Was this section helpful?