All Courses

Weight Initialization Strategies

Once you've defined the architecture of your neural network, specifying the layers and their connections, the next important step before training can begin is setting the initial values for the weights (and biases) in each layer. You might wonder, why not just start them all at zero? Or perhaps random numbers? It turns out that the way you initialize these parameters can have a significant impact on how effectively your network trains. Poor initialization can lead to slow convergence or prevent the network from learning altogether.

The Trouble with Simple Initialization

Let's consider a couple of straightforward approaches and why they often fall short:

Zero Initialization: Initializing all weights to zero might seem like a clean slate, but it introduces a critical problem: symmetry. If all weights into a layer are zero, then during the forward pass, all neurons in that layer will produce the same output. Consequently, during the backward pass (backpropagation), they will all receive the same gradient signal and update their weights identically. Effectively, all neurons in the layer behave as one, negating the benefit of having multiple neurons. The network fails to break symmetry and learn diverse features.
Large Random Values: Okay, so zero is bad. What about initializing weights to large random values? This breaks symmetry, but it can cause its own problems. If the weights are too large, the input to activation functions (like Sigmoid or Tanh) can fall into their saturation regions very quickly. In these regions, the gradient is close to zero. This leads to the "vanishing gradient" problem, where the gradients become extremely small, making weight updates negligible and stalling the learning process. Large weights can also sometimes lead to "exploding gradients," where gradients become excessively large, causing unstable updates and divergence.
Small Random Values: Initializing weights to small random numbers (e.g., sampled from a Gaussian distribution with a small standard deviation) is better. It breaks symmetry and reduces the chance of immediate saturation. However, if the values are too small, the variance of the outputs might shrink progressively as activations propagate through the network layers, potentially leading to vanishing gradients later in training, especially in deep networks.

The challenge is finding a "sweet spot", an initialization scheme that breaks symmetry but keeps the signal (activations and gradients) flowing effectively through the network layers.

The Goal: Preserving Variance

Modern weight initialization strategies are designed with a specific goal in mind: to maintain the variance of activations and gradients as they propagate through the network. If the variance remains roughly constant from layer to layer during the forward pass (activations) and backward pass (gradients), the network is less likely to suffer from vanishing or exploding signals. This helps ensure that all layers learn at a reasonable rate.

Two widely adopted initialization methods based on this principle are Xavier/Glorot initialization and He initialization.

Xavier (Glorot) Initialization

Proposed by Xavier Glorot and Yoshua Bengio in 2010, this method is designed to work well with activation functions that are symmetric around zero and have outputs bounded within a certain range, like Sigmoid and Tanh.

It sets the weights by drawing them from a distribution with zero mean and a carefully chosen variance. The variance depends on the number of input units ( $n_{in}$ or $fan_{in}$ ) and output units ( $n_{out}$ or $fan_{out}$ ) of the layer.

Normal Distribution: Weights are drawn from a normal distribution with mean 0 and variance $\sigma^2$ :
$\sigma^2 = \frac{2}{n_{in} + n_{out}}$
Uniform Distribution: Weights are drawn from a uniform distribution in the range $[-r, r]$ , where:
$r = \sqrt{\frac{6}{n_{in} + n_{out}}}$

The reasoning is that this scaling helps keep the variance of the outputs of a layer roughly equal to the variance of its inputs, and similarly for the gradients during backpropagation.

He Initialization

While Xavier initialization works well for Sigmoid and Tanh, it's less optimal for the Rectified Linear Unit (ReLU) activation function and its variants (Leaky ReLU, ELU, etc.). ReLU sets all negative inputs to zero, which changes the variance dynamics compared to symmetric functions like Tanh.

Kaiming He et al. proposed an initialization method specifically tailored for ReLU-based networks in 2015. It accounts for the non-linearity of ReLU by adjusting the variance scaling.

Normal Distribution: Weights are drawn from a normal distribution with mean 0 and variance $\sigma^2$ :
$\sigma^2 = \frac{2}{n_{in}}$
Uniform Distribution: Weights are drawn from a uniform distribution in the range $[-r, r]$ , where:
$r = \sqrt{\frac{6}{n_{in}}}$

Notice that He initialization primarily considers the number of input units ( $n_{in}$ ). This adjustment helps prevent the signal variance from decreasing too rapidly through layers of ReLU units.

Practical Application and Framework Support

Deep learning frameworks like PyTorch and TensorFlow/Keras make implementing these strategies straightforward. When defining layers, you often have options to specify the desired weight initialization method.

For example, in PyTorch, you might define a linear layer and then apply He initialization:

import torch
import torch.nn as nn
import math

# Example layer dimensions
in_features = 128
out_features = 64

# Define a linear layer
linear_layer = nn.Linear(in_features, out_features)

# Apply He initialization (normal distribution)
nn.init.kaiming_normal_(linear_layer.weight, mode='fan_in', nonlinearity='relu')

# Initialize biases (often to zero or small constant)
if linear_layer.bias is not None:
    nn.init.constant_(linear_layer.bias, 0)

print(f"Initialized weight shape: {linear_layer.weight.shape}")
print(f"Sample weights (first 5):\n{linear_layer.weight[0, :5]}")
print(f"\nInitialized bias shape: {linear_layer.bias.shape}")
print(f"Bias values:\n{linear_layer.bias[:5]}")

Many frameworks default to using either He or Xavier initialization depending on the context or layer type, but knowing how to explicitly set them is valuable for fine-tuning or when implementing custom layers.

Choosing the Right Strategy

For ReLU and its variants (Leaky ReLU, PReLU, ELU): He initialization is generally the recommended choice.
For Sigmoid and Tanh: Xavier/Glorot initialization is typically preferred.

What about biases? They are often initialized to zero. However, for ReLU units, initializing biases to a small positive value (e.g., 0.01) is sometimes done to ensure that most ReLU units are active initially, but zero initialization remains common and effective.

While these initialization methods provide excellent starting points, they don't eliminate the need for other techniques like Batch Normalization (covered later) which further help stabilize training dynamics. Nonetheless, proper weight initialization is a foundation for building deep networks that train efficiently. It sets the stage for gradient descent to effectively navigate the loss and find a good solution.

Was this section helpful?