As discussed, the vanishing and exploding gradient problems pose significant hurdles when training Recurrent Neural Networks, especially over many time steps. Backpropagation Through Time (BPTT) involves repeated multiplication by the same recurrent weight matrix (Whh in our simple RNN formulation). If the magnitudes of the weights (or more precisely, the eigenvalues or singular values of the weight matrix) are consistently less than 1, gradients shrink exponentially as they flow backward. Conversely, if they are consistently greater than 1, gradients grow exponentially. The way we initialize the network's weights at the start of training plays a significant role in setting the stage for stable gradient flow.
Initializing weights randomly is standard practice, but the scale and distribution of these random values matter greatly. Naive initialization strategies, like drawing weights from a uniform distribution with a very small range (e.g., U[−0.01,0.01]) or a Gaussian distribution with a tiny standard deviation, often lead directly to vanishing gradients, particularly in deep networks or RNNs unrolled over many steps. Similarly, initializing with excessively large values can trigger exploding gradients right from the start.
Modern deep learning relies on more principled initialization techniques designed to maintain signal variance as information propagates forward through layers and gradient variance as errors propagate backward.
Proposed by Glorot and Bengio (2010), this method aims to keep the variance of activations and gradients approximately constant across layers. It assumes a symmetric activation function like hyperbolic tangent (tanh
), which is common in simple RNNs and LSTMs/GRUs. The core idea is to scale the initial weights based on the number of input (nin) and output (nout) units for a given layer (or weight matrix).
For an RNN cell, nin would typically be the size of the input feature vector plus the size of the hidden state, and nout would be the size of the hidden state. Many deep learning frameworks use Glorot initialization as the default for dense and recurrent layers.
Developed by He et al. (2015), this initialization scheme is specifically designed for layers using Rectified Linear Unit (ReLU) activation functions and its variants (Leaky ReLU, ELU). Since ReLU is not symmetric and zeros out negative inputs, the variance dynamics differ from tanh
. He initialization accounts for this.
Notice that He initialization only considers the number of input units (nin). While ReLU is less common as the primary recurrent activation within the cell (due to potential for unbounded activations leading to instability), it might be used in feedforward connections within more complex cells or in output layers. If using ReLU-like activations heavily, He initialization is often preferred.
Specific to RNNs, the recurrent weight matrix (Whh) is repeatedly applied at each time step. Orthogonal matrices have the property that they preserve the norm (length) of vectors they multiply. Initializing Whh to be (or be close to) an orthogonal matrix can theoretically help preserve the gradient norm during BPTT, mitigating both vanishing and exploding gradients.
Achieving perfect orthogonality during initialization often involves techniques like Singular Value Decomposition (SVD) on an initial random matrix, setting the singular values to 1, and reconstructing the matrix. While computationally slightly more involved than Glorot or He, it can be particularly effective for the recurrent weights in vanilla RNNs or LSTMs/GRUs, especially when dealing with very long sequences. Frameworks might offer this as a specific initializer option (e.g., tf.keras.initializers.Orthogonal
or torch.nn.init.orthogonal_
).
While sophisticated initialization helps establish a good starting point for training, it doesn't completely eliminate gradient problems, especially for very deep networks or long sequences. It acts as a crucial preventative measure, often used in conjunction with techniques like gradient clipping (discussed next) and architectures like LSTMs and GRUs (covered in subsequent chapters) that are inherently more robust to these issues. Choosing an appropriate initialization strategy is a fundamental step in configuring your RNN model for successful training.
© 2025 ApX Machine Learning