Training deep neural networks often involves navigating treacherous optimization terrain. As gradients propagate backward through many layers, their magnitudes can undergo dramatic changes, leading to significant training difficulties. Two primary issues arise: exploding gradients, where gradient values become excessively large, and vanishing gradients, where they become infinitesimally small. Understanding and mitigating these problems is essential for successfully training deep models, particularly recurrent neural networks (RNNs).
Backpropagation computes gradients via the chain rule, multiplying derivatives layer by layer. In a deep network with L layers, the gradient of the loss J with respect to parameters θ1 in the first layer involves a product of L Jacobian matrices (or scalars in simpler cases):
∂θ1∂J∝∂hL−1∂hL∂hL−2∂hL−1…∂h1∂h2∂θ1∂h1
Here, hi represents the activation of layer i. If the terms in this product (related to weights and activation function derivatives) are consistently greater than 1, the overall gradient magnitude can grow exponentially, leading to exploding gradients. Conversely, if these terms are consistently less than 1, the gradient magnitude can shrink exponentially, resulting in vanishing gradients.
When gradients become extremely small during backpropagation, the weight updates for the earlier layers (closer to the input) become negligible. This effectively stops these layers from learning meaningful representations.
While techniques like ReLU activation functions, residual connections, and careful initialization (discussed elsewhere) are primary defenses against vanishing gradients, they are not foolproof.
Exploding gradients manifest as extremely large gradient values. When these large gradients are used in the parameter update step, they can cause dramatic changes in the weights, potentially undoing previous learning or leading to numerical overflow (represented as NaN
or Inf
loss values).
Gradient flow visualization. During backpropagation (red path), gradients are computed by multiplying local derivatives. Repeated multiplication can cause the final gradient (e.g., ∂J/∂θ1) to become extremely large or small.
While vanishing gradients require architectural or initialization solutions, exploding gradients can often be managed directly using a technique called gradient clipping. The core idea is simple: if the overall magnitude (norm) or individual values of the gradients exceed a predefined threshold during backpropagation, they are rescaled or capped to stay within bounds. This prevents single large gradients from causing catastrophic updates.
This is the most common form of gradient clipping. It rescales the entire gradient vector g (containing gradients for all parameters ∂θ∂J) if its L2 norm ∥g∥=∑igi2 exceeds a threshold c.
The rule is: if ∥g∥>c then g←∥g∥cg
An alternative approach is to clip each individual gradient component gi to lie within a specific range [−c,c].
The rule applied element-wise is: gi←max(−c,min(gi,c))
Gradient clipping is typically applied after calculating the gradients for all parameters in a batch (or mini-batch) but before the optimizer (like Adam or SGD) uses these gradients to update the weights. Most deep learning frameworks provide straightforward functions to implement gradient clipping (usually clip-by-norm).
Gradient L2 norm over training iterations. Without clipping (blue dashed line), occasional spikes (exploding gradients) occur. With clipping by norm at a threshold of 5.0 (red solid line), these spikes are capped, preventing instability while leaving smaller gradients untouched.
Gradient explosion and vanishing are significant hurdles in training deep networks, arising from the multiplicative nature of backpropagation.
Gradient clipping is a pragmatic tool, especially vital for RNNs, to ensure training stability. While it doesn't solve the underlying reasons for exploding gradients, it effectively prevents them from derailing the optimization process. Understanding both phenomena and their respective solutions is part of the toolkit required for effectively training complex deep learning models.
© 2025 ApX Machine Learning