All Courses

Weight Initialization Strategies for Deep Networks

As we construct deeper neural networks, the seemingly simple task of assigning initial values to the network's weights becomes significantly important. Poor initialization can dramatically slow down training or even prevent the network from learning altogether. This happens because the scale of activations and gradients can grow or shrink exponentially as they propagate through the layers, leading to exploding or vanishing gradients, respectively. Proper weight initialization aims to mitigate these issues by setting the initial weights in a way that maintains signal propagation and facilitates stable gradient flow.

The Problem: Vanishing and Exploding Signals

Imagine passing an input signal through many layers. In each layer, the activations are computed based on a weighted sum of the previous layer's outputs, followed by an activation function. If the weights are consistently too small, the variance of the activations will decrease exponentially layer by layer, eventually becoming negligible. This is the vanishing activation problem. Gradients computed during backpropagation will also vanish, meaning the weights in earlier layers learn extremely slowly, if at all.

Conversely, if weights are consistently too large, the variance of activations can explode exponentially, leading to massive values. This can cause numerical overflow issues and also leads to the exploding gradient problem during backpropagation, where gradients become enormous, causing unstable updates and divergence.

Consider a simple linear network. The output variance after $L$ layers is roughly proportional to the product of the variances of the weights in each layer. If the weight variance deviates consistently from 1, the output variance will either vanish or explode. Non-linear activation functions complicate this but the core problem remains.

A view of signal propagation. Good initialization helps maintain signal variance through the network layers.

Xavier (Glorot) Initialization

Proposed by Glorot and Bengio in 2010, Xavier initialization is designed to keep the variance of activations and gradients approximately equal across layers, assuming linear or symmetrically saturating activation functions like tanh or sigmoid.

The core idea is to scale the weights based on the number of input ( $n_{in}$ ) and output ( $n_{out}$ ) units for a given layer.

For a Uniform distribution: Weights are sampled from $U[-limit, limit]$ , where $limit = \sqrt{\frac{6}{n_{in} + n_{out}}}$
For a Normal distribution: Weights are sampled from $\mathcal{N}(0, \sigma^2)$ , where $\sigma^2 = \frac{2}{n_{in} + n_{out}}$

This strategy balances the signal variance during the forward pass and the gradient variance during the backward pass. It was a significant improvement for training deeper networks with symmetric activation functions.

He (Kaiming) Initialization

While Xavier initialization works well for tanh and sigmoid, it's less ideal for Rectified Linear Units (ReLU) and its variants (Leaky ReLU, PReLU). ReLU sets all negative inputs to zero, which affects the variance statistics differently than symmetric functions.

He initialization, proposed by He et al. in 2015, specifically accounts for the properties of ReLU. Since ReLU effectively removes half of the activations (the negative part), the variance needs to be adjusted accordingly. He initialization scales weights based only on the number of input units ( $n_{in}$ ) to preserve variance in the forward pass.

For a Uniform distribution: Weights are sampled from $U[-limit, limit]$ , where $limit = \sqrt{\frac{6}{n_{in}}}$
For a Normal distribution: Weights are sampled from $\mathcal{N}(0, \sigma^2)$ , where $\sigma^2 = \frac{2}{n_{in}}$

This approach helps prevent the variance from decreasing too rapidly through layers composed of ReLU units, making it the standard choice for modern deep CNNs that predominantly use ReLU or its variants.

Comparison of standard deviations ( $\sigma$ ) used in He and Xavier normal initialization, assuming $n_{out} = n_{in}$ for Xavier. He initialization uses larger initial weights to compensate for ReLU's effect on variance.

Practical Implementation and Bias Initialization

Most deep learning frameworks provide easy access to these initializers.

PyTorch Example:

import torch
import torch.nn as nn

# Example for a Conv2d layer using He initialization
conv_layer = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3)
nn.init.kaiming_normal_(conv_layer.weight, mode='fan_in', nonlinearity='relu')

# Example for a Linear layer using Xavier initialization
linear_layer = nn.Linear(in_features=512, out_features=256)
nn.init.xavier_uniform_(linear_layer.weight)

# Bias initialization (common practice: zero)
if conv_layer.bias is not None:
    nn.init.constant_(conv_layer.bias, 0)
if linear_layer.bias is not None:
    nn.init.constant_(linear_layer.bias, 0)

TensorFlow/Keras Example:

import tensorflow as tf
from tensorflow.keras import layers

# Example for a Conv2D layer using He initialization (default for Conv2D/Dense with ReLU)
conv_layer = layers.Conv2D(
    filters=128,
    kernel_size=3,
    activation='relu', # ReLU implies HeNormal by default in Keras
    kernel_initializer=tf.keras.initializers.HeNormal(),
    bias_initializer='zeros' # Default bias initializer
)

# Example for a Dense layer using Glorot (Xavier) initialization
dense_layer = layers.Dense(
    units=256,
    activation='tanh', # Tanh activation often pairs well with Glorot
    kernel_initializer=tf.keras.initializers.GlorotUniform(),
    bias_initializer='zeros'
)

Notice the mode parameter in PyTorch's kaiming_normal_ (or fan_in vs fan_out choice). fan_in corresponds to the standard He initialization ( $n_{in}$ ), while fan_out ( $n_{out}$ ) is sometimes used. fan_in is generally preferred for forward propagation stability with ReLUs. Keras often selects the appropriate default initializer based on the layer type and activation function.

Regarding bias initialization, the most common practice is to initialize biases to zero. This is generally safe and effective. Sometimes, particularly for ReLU units, initializing biases to a small positive constant (e.g., 0.01 or 0.1) was suggested to ensure that ReLUs fire initially, but zero initialization is typically sufficient with proper weight initialization and techniques like Batch Normalization.

Summary

Selecting an appropriate weight initialization strategy is a fundamental step in successfully training deep neural networks. While simple initialization might work for shallow networks, deep architectures require methods like Xavier/Glorot (for symmetric activations) or He/Kaiming (for ReLU-based activations) to maintain signal variance and prevent gradient issues. Modern frameworks make implementing these strategies straightforward, significantly improving the likelihood of stable and efficient training. Remember to choose the initializer that matches your network's primary activation function.

Was this section helpful?