Masterclass
As introduced, initializing the weights of a deep network requires careful consideration. Simply drawing weights from a standard normal or uniform distribution often leads to unstable training dynamics. Gradients can shrink exponentially as they propagate backward (vanishing gradients), or grow exponentially (exploding gradients), preventing the model from learning effectively.
The Xavier initialization method, proposed by Xavier Glorot and Yoshua Bengio in 2010, directly addresses this by aiming to keep the variance of activations and gradients roughly constant across layers. Maintaining consistent variance helps ensure that the signal (activations during the forward pass and gradients during the backward pass) doesn't vanish or explode as it traverses the network's depth.
Consider a single linear layer in a neural network performing the operation y=Wx+b, where W is the weight matrix, x is the input vector, b is the bias vector, and y is the output before the activation function. Xavier initialization analyzes how the variance propagates through this layer.
The core assumptions are:
Under these assumptions, the variance of the layer's output yi​ (for a single neuron i) can be related to the input variance and weight variance:
Var(yi​)=nin​Var(Wij​)Var(xj​)
Here, nin​ represents the number of input connections to the neuron (the "fan-in"). To keep the output variance Var(yi​) equal to the input variance Var(xj​), we need:
nin​Var(Wij​)=1 Var(Wij​)=nin​1​
This condition ensures that the variance of the activations doesn't significantly change during the forward pass.
Similarly, considering the backward pass, we analyze the variance of the gradients. The variance of the gradient with respect to the activations of the previous layer depends on the fan-out nout​ (the number of neurons the current layer's output connects to). To maintain gradient variance, the condition becomes:
nout​Var(Wij​)=1 Var(Wij​)=nout​1​
Xavier initialization seeks a compromise between these two conditions (maintaining activation variance forward and gradient variance backward) by averaging the constraints:
Var(Wij​)=nin​+nout​2​
This variance forms the basis for the initialization strategy.
Weights are typically drawn from either a uniform or a normal distribution scaled according to this derived variance.
Xavier Uniform Initialization: Weights are sampled from a uniform distribution U[−a,a], where a is chosen such that the variance is nin​+nout​2​. The variance of U[−a,a] is 12(2a)2​=3a2​. Setting this equal to the target variance: 3a2​=nin​+nout​2​ a2=nin​+nout​6​ a=nin​+nout​6​​
So, the weights are initialized using: W∼U[−nin​+nout​6​​,nin​+nout​6​​]
Xavier Normal Initialization: Weights are sampled from a normal distribution N(0,σ2) with mean 0 and the target variance σ2: σ2=nin​+nout​2​
So, the weights are initialized using: W∼N(0,nin​+nout​2​​)
In both cases, nin​ is the number of inputs to the layer (fan-in), and nout​ is the number of outputs from the layer (fan-out). Biases are typically initialized to zero.
Xavier initialization works well for layers followed by activation functions that are roughly linear around zero and symmetric, such as tanh and the logistic sigmoid. The assumption f′(0)≈1 holds reasonably well for these functions. However, it is less suited for Rectified Linear Units (ReLU) and its variants, which are asymmetric and have a derivative of 0 for negative inputs. This limitation motivated the development of Kaiming initialization, which we will discuss next.
In PyTorch, you can easily apply Xavier initialization using the torch.nn.init
module.
import torch
import torch.nn as nn
# Define example layer dimensions
fan_in, fan_out = 512, 256 # Example dimensions for a linear layer
# Create a linear layer
linear_layer = nn.Linear(fan_in, fan_out, bias=True)
# --- Xavier Uniform Initialization ---
# Apply Xavier uniform initialization to the weights
nn.init.xavier_uniform_(linear_layer.weight)
# Typically initialize biases to zero
if linear_layer.bias is not None:
nn.init.constant_(linear_layer.bias, 0)
print(f"Layer: Linear({fan_in}, {fan_out})")
print("Initialization: Xavier Uniform")
print(f"Weight Mean: {linear_layer.weight.mean():.4f}, Std: {linear_layer.weight.std():.4f}")
# Check theoretical standard deviation for uniform
# Var = (sqrt(6/(fan_in+fan_out)))^2 / 3 = 2 / (fan_in+fan_out)
# Std = sqrt(2 / (fan_in+fan_out))
theoretical_std_uniform = (2.0 / (fan_in + fan_out))**0.5
print(f"Theoretical Std Dev (Uniform based): {theoretical_std_uniform:.4f}\n")
# --- Xavier Normal Initialization ---
# Re-create the layer or re-initialize
linear_layer_norm = nn.Linear(fan_in, fan_out, bias=True)
# Apply Xavier normal initialization to the weights
nn.init.xavier_normal_(linear_layer_norm.weight)
# Initialize biases to zero
if linear_layer_norm.bias is not None:
nn.init.constant_(linear_layer_norm.bias, 0)
print("Initialization: Xavier Normal")
print(
f"Weight Mean: {linear_layer_norm.weight.mean():.4f}, "
f"Std: {linear_layer_norm.weight.std():.4f}"
)
# Theoretical standard deviation for normal is sqrt(2 / (fan_in+fan_out))
theoretical_std_normal = (2.0 / (fan_in + fan_out))**0.5
print(f"Theoretical Std Dev (Normal): {theoretical_std_normal:.4f}")
Example output showing weight statistics after Xavier initialization. The actual standard deviation of the initialized weights should be close to the theoretical value derived from the fan-in and fan-out.
Xavier initialization provides a principled way to set initial weights, promoting more stable signal propagation in deep networks, particularly those using symmetric activation functions. While it represented a significant improvement over naive random initialization, its assumptions don't perfectly match all modern network architectures, especially those heavily reliant on ReLU activations. This leads us to the next technique, Kaiming initialization, designed specifically for such cases.
© 2025 ApX Machine Learning