Recall from our discussion of Backpropagation Through Time (BPTT) that training an RNN involves calculating the gradient of the loss function with respect to the network's weights. This gradient tells us how to adjust the weights to improve performance. BPTT achieves this by unrolling the network through time and applying the chain rule backward from the final time step to the first.
The vanishing gradient problem arises directly from the mechanics of this backward pass, particularly due to the repeated application of the chain rule over many time steps. To understand why, let's consider how the gradient information flows. The gradient of the loss L with respect to the hidden state at an early time step k, denoted ∂hk∂L, depends on the gradient at a later time step t, ∂ht∂L. The connection is established through the chain rule, involving the derivatives of the hidden state transitions between steps k and t:
∂hk∂L=∂ht∂L∂hk∂htThe term ∂hk∂ht represents how a change in the hidden state at step k affects the hidden state at step t. This itself is a product of intermediate Jacobians (matrices of partial derivatives) for each step from k+1 to t:
∂hk∂ht=∂ht−1∂ht∂ht−2∂ht−1…∂hk∂hk+1=i=k+1∏t∂hi−1∂hiEach intermediate Jacobian ∂hi−1∂hi depends on the recurrent weight matrix Whh and the derivative of the activation function used in the RNN cell. For a simple RNN with activation function f, the hidden state update is hi=f(Whhhi−1+Wxhxi+bh). The derivative ∂hi−1∂hi involves WhhT multiplied by the element-wise derivative of the activation function f′.
Now, consider what happens if the "magnitude" (more formally, the largest singular value or spectral radius) of these Jacobian matrices is consistently less than 1. When we multiply these matrices together many times, as required when the time gap t−k is large, the overall magnitude of the resulting product ∂hk∂ht shrinks exponentially.
Think of it like repeatedly multiplying a number by 0.9.
The value rapidly approaches zero.
Activation functions like the hyperbolic tangent (tanh) or the sigmoid function, commonly used in early RNNs, have derivatives that are strictly less than 1 for most inputs (tanh's derivative is always ≤1, and sigmoid's derivative is always ≤0.25). When these small derivatives are multiplied repeatedly during the backward pass through many time steps, potentially combined with recurrent weight matrices Whh whose norms might also be less than 1 (perhaps due to initialization or regularization), the Jacobians ∂hi−1∂hi often have magnitudes less than 1.
Illustration of gradient flow during BPTT. The gradient signal (red arrows, thickness indicating magnitude) diminishes as it propagates backward through time steps (from right to left). This decay is caused by repeated multiplication by Jacobian terms, often having magnitudes less than 1 due to activation function derivatives (f′) and weight matrices (Whh).
The practical consequence is severe: the gradient signal originating from errors at late time steps becomes incredibly small, or "vanishes", by the time it reaches the network layers corresponding to early time steps. These near-zero gradients mean that the weights (Whh, Wxh, bh) responsible for processing the sequence at those early steps receive almost no update signal based on long-term outcomes.
Essentially, the network becomes unable to learn correlations between events separated by long intervals in the sequence. If correctly predicting the sentiment at the end of a long paragraph depends critically on a word used near the beginning, a simple RNN suffering from vanishing gradients will likely fail to capture this dependency. The initial word's influence on the final prediction error gets lost as the gradient signal fizzles out on its way back. This significantly undermines the primary motivation for using recurrent networks: to model temporal dependencies, including long-range ones.
Understanding this vanishing gradient phenomenon is fundamental because it directly motivated the development of more sophisticated recurrent architectures, such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), which we will explore in subsequent chapters. These architectures incorporate specific mechanisms, often called "gates", designed to regulate information flow and allow gradients to propagate more effectively over long durations, mitigating the vanishing gradient problem.
© 2025 ApX Machine Learning