Training deep neural networks often feels like navigating a complex, high-dimensional landscape blindfolded. Where you start your descent, determined by the initial values of the network's weights, profoundly influences the path optimization takes and whether you reach a good solution efficiently. Setting all weights to zero might seem simple, but it leads to a critical failure: symmetry. If all neurons in a layer start with the same weights, they will compute the same output and receive the same gradient during backpropagation. They remain identical, preventing the network from learning diverse features. Initializing with small random numbers breaks symmetry, but the scale matters immensely.
Consider the forward propagation through a layer: y=Wx+b. If the weights W are too large, the output y can grow exponentially as it passes through successive layers, leading to exploding activations and gradients. Conversely, if weights are too small, the output can shrink exponentially, causing vanishing activations and gradients. This is especially problematic in deep networks where signals must traverse many layers. Vanishing gradients halt learning, as updates become infinitesimally small, while exploding gradients cause instability, often resulting in NaN
values during training.
The core challenge is to initialize weights such that the variance of the outputs of a layer remains roughly equal to the variance of its inputs. Similarly, during backpropagation, we want the variance of the gradients to remain stable as they flow backward through the network. Maintaining this signal variance helps ensure that information propagates effectively, preventing gradients from vanishing or exploding purely due to the initialization scale.
Fortunately, principled approaches exist to address this scaling problem. These methods select initial weights from distributions whose variance is carefully chosen based on the layer's dimensions.
Proposed by Glorot and Bengio in 2010, Xavier initialization is designed for layers using activation functions that are roughly linear and symmetric around zero, like the hyperbolic tangent (tanh) or the logistic sigmoid (though less ideal for sigmoid). The goal is to keep the variance of activations and backpropagated gradients approximately constant across layers.
Assuming the inputs x and gradients ∂y∂L have zero mean and are independent, and also independent of the weights W, the variance of the layer's output yi=∑j=1ninWijxj (ignoring bias for variance calculation) is:
Var(yi)=ninVar(Wij)Var(xj)
Similarly, for the gradient calculation during backpropagation with respect to the input of the layer (∂xj∂L), we have:
Var(∂xj∂L)=noutVar(Wij)Var(∂yi∂L)
where nin is the number of input neurons (fan-in) and nout is the number of output neurons (fan-out) for the layer.
To ensure Var(yi)≈Var(xj) and Var(∂xj∂L)≈Var(∂yi∂L), we need:
ninVar(Wij)≈1andnoutVar(Wij)≈1
A compromise is to average these conditions, leading to the desired variance for the weights:
Var(Wij)=nin+nout2
Based on this variance, two common initialization strategies emerge:
Xavier initialization significantly improves training stability for networks with symmetric activations.
The assumptions made by Xavier initialization (especially the linearity assumption around zero) break down for Rectified Linear Units (ReLU) and its variants (Leaky ReLU, etc.). ReLU units output zero for negative inputs, effectively "killing" half of the gradients and reducing the variance of the output compared to the input.
He et al. (2015) proposed an initialization strategy specifically tailored for ReLU activations. Analyzing the variance propagation through a ReLU layer, they found that setting the variance of the weights to:
Var(Wij)=nin2
helps maintain the variance of the activations during the forward pass. This derivation considers that ReLU zeros out negative inputs, halving the effective number of contributing terms to the variance compared to a linear unit.
The corresponding initialization strategies are:
He initialization is generally the preferred method when using ReLU or its variants, which are common in modern deep learning architectures.
Dense
/Linear
and Conv2D
layers when the activation isn't specified or is ReLU). However, understanding the principles allows for informed customization when needed.The effect of proper initialization on optimization is profound. A good starting point, provided by variance-aware initialization, places the model in a region of the loss landscape where gradients are more informative. This helps optimizers like SGD, Adam, or RMSprop make meaningful progress from the very first iterations, accelerating convergence and increasing the likelihood of finding a high-quality minimum. Poor initialization, in contrast, can lead to extremely slow learning or complete stagnation, regardless of the sophistication of the optimization algorithm used.
Consider the conceptual difference in early training stages:
Comparison of training loss progression. Proper initialization (green) leads to steady convergence. Poor initialization might result in stalled training due to vanishing gradients (light red) or instability from exploding gradients (dark red, often leading to NaN/Inf values indicated by the break).
In summary, while advanced optimization algorithms handle many difficulties, they rely on a reasonable starting point. Variance-aware initialization techniques like Xavier and He are fundamental tools for enabling stable and efficient training of deep neural networks, directly addressing the challenges posed by signal propagation in deep architectures.
© 2025 ApX Machine Learning