All Courses

Common Initialization Strategies (Xavier, He)

Initializing the weights of a neural network might seem like a minor detail, but it's a critical step that significantly impacts the training process. As discussed in the chapter introduction, choosing the right starting point for weights helps prevent issues like vanishing or exploding gradients, where the signal either becomes too weak or too strong as it propagates through the network layers. Poor initialization can lead to slow convergence or prevent the network from learning altogether.

Think about training as rolling a ball down a complex, bumpy surface (the loss surface) towards the lowest point. Where you initially place the ball affects how easily and quickly it finds the bottom. If you start it on a flat plateau or near a steep cliff, it might get stuck or overshoot wildly. Smart initialization places the ball in a region where it's more likely to roll smoothly downhill.

Early approaches often used simple random initialization, drawing weights from a standard normal distribution (mean 0, variance 1) or a small uniform distribution. However, this doesn't account for the number of inputs or outputs of a neuron. In deep networks, this naive approach can cause the variance of the outputs of a layer to grow or shrink exponentially with the number of layers, leading precisely to the exploding or vanishing gradient problems we want to avoid.

To address this, specific initialization strategies were developed to maintain a reasonable variance of activations and gradients throughout the network. Two of the most popular and effective methods are Xavier (also known as Glorot) initialization and He initialization.

Xavier / Glorot Initialization

Proposed by Xavier Glorot and Yoshua Bengio in 2010, this strategy aims to keep the variance of activations and gradients roughly constant across layers. The core idea is to scale the initial weights based on the number of input ( $n_{in}$ ) and output ( $n_{out}$ ) units of the layer.

This method works particularly well with activation functions that are symmetric around zero and have a derivative close to 1 near zero, such as the hyperbolic tangent (tanh) or the logistic sigmoid function.

The main insight is to set the variance of the weights $W$ for a layer as:

Var(W) = \frac{2}{n_{in} + n_{out}}

To implement this, you typically sample weights from either a normal distribution or a uniform distribution scaled according to this variance.

Normal Distribution: Sample weights from $\mathcal{N}(0, \sigma^2)$ , where the standard deviation $\sigma$ is:
$\sigma = \sqrt{\frac{2}{n_{in} + n_{out}}}$
Uniform Distribution: Sample weights from $\mathcal{U}[-limit, limit]$ , where the limit is:
$limit = \sqrt{\frac{6}{n_{in} + n_{out}}}$
(The factor of $\sqrt{6}$ comes from the relationship between the variance of a uniform distribution and its limits: $Var(\mathcal{U}[-a, a]) = a^2/3$ ).

Xavier initialization helps ensure that signals don't vanish or explode too quickly when using activations like tanh or sigmoid, promoting more stable training.

He Initialization

While Xavier initialization works well for symmetric activations, the rise of the Rectified Linear Unit (ReLU) activation function presented a challenge. ReLU sets all negative inputs to zero ( $f(x) = max(0, x)$ ), meaning it's not symmetric around zero and effectively "kills" half of the gradients during backpropagation for neurons with negative inputs. This non-linearity changes the variance dynamics.

Kaiming He et al. observed this issue in 2015 and proposed a modification specifically tailored for ReLU and its variants (like Leaky ReLU, PReLU). Since ReLU discards negative values, roughly halving the variance contributed by inputs, He initialization compensates by doubling the variance compared to one of the Xavier formulas.

The recommended variance for He initialization is:

Var(W) = \frac{2}{n_{in}}

Similar to Xavier, you can implement this using either a normal or uniform distribution:

Normal Distribution: Sample weights from $\mathcal{N}(0, \sigma^2)$ , where the standard deviation $\sigma$ is:
$\sigma = \sqrt{\frac{2}{n_{in}}}$
Note: Some implementations might use $\sigma = \frac{1}{\sqrt{n_{in}}}$ but the paper's derivation points towards $\sqrt{2/n_{in}}$ for ensuring forward propagation variance stability. Framework defaults often use the $\sqrt{2/n_{in}}$ version.
Uniform Distribution: Sample weights from $\mathcal{U}[-limit, limit]$ , where the limit is:
$limit = \sqrt{\frac{6}{n_{in}}}$

He initialization helps maintain a healthy signal variance when using ReLU-based activations, which is important for training the very deep networks common today.

Visualizing Activation Variance

Let's imagine passing data through a few layers using ReLU activations. We can visualize how the distribution of activations changes depending on the initialization strategy. Ideally, we want the variance to remain somewhat stable, not collapsing to zero or exploding.

Simplified representation of activation distributions after passing through several layers using ReLU activations. Zero initialization leads to zero activations. Naive random normal initialization often sees variance shrink significantly. Xavier improves this but is not optimal for ReLU. He initialization is designed to maintain variance better with ReLU, preventing activations from vanishing too quickly.

Practical Implementation in PyTorch

Most deep learning frameworks provide built-in functions for these initialization strategies. In PyTorch, you can apply them when defining your network or afterwards by iterating through the layers.

Here's a common way to apply He initialization (specifically kaiming_normal_) to linear and convolutional layers in a custom nn.Module:

import torch
import torch.nn as nn
import math

class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNet, self).__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.relu1 = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, hidden_size)
        self.relu2 = nn.ReLU()
        self.layer3 = nn.Linear(hidden_size, output_size)
        
        # Apply He initialization
        self._initialize_weights()

    def forward(self, x):
        x = self.relu1(self.layer1(x))
        x = self.relu2(self.layer2(x))
        x = self.layer3(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                # He initialization for Linear layers
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
                # Optionally initialize bias to zero
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Conv2d):
                 # He initialization for Conv layers
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            # Add elif for other layer types if needed

# Example usage:
input_dim = 784 # Example: Flattened MNIST image
hidden_dim = 128
output_dim = 10 # Example: 10 classes for MNIST
model = SimpleNet(input_dim, hidden_dim, output_dim)

print("Weights of layer1 initialized using He normal:")
print(model.layer1.weight.data)

In this example:

We define a simple multi-layer perceptron.
The _initialize_weights method iterates through all modules (self.modules()).
isinstance checks if a module is a Linear or Conv2d layer.
nn.init.kaiming_normal_ applies He initialization using a normal distribution. mode='fan_in' uses $n_{in}$ in the calculation, standard for feedforward networks. nonlinearity='relu' is specified because He initialization is designed for ReLU.
nn.init.constant_(m.bias, 0) sets biases to zero, a common practice.

If you were using tanh activations, you would replace nn.init.kaiming_normal_ with nn.init.xavier_normal_ or nn.init.xavier_uniform_ and potentially change nonlinearity='relu' to nonlinearity='tanh'.

Choosing the right initialization strategy, typically He for ReLU-based networks and Xavier for tanh/sigmoid-based networks, provides a much better starting point for optimization algorithms, leading to faster convergence and more stable training. It's a simple yet powerful technique in the deep learning practitioner's toolkit.

Was this section helpful?