Once you've defined the structure of your neural network, the layers, the number of neurons in each layer, and the activation functions they use, the next step is to give the network's parameters, its weights (W) and biases (b), their initial values. This might seem like a minor detail, but the way you initialize these parameters can significantly impact whether your network trains effectively or struggles to learn at all. Initialization sets the starting point in the high-dimensional loss landscape from which gradient descent will begin its search for a minimum. A poor starting point can lead to slow convergence, getting stuck in poor local minima, or numerical instability during training.
The simplest approach might seem to be initializing all weights and biases to zero. Let's consider why this is problematic. If all weights (W) connected to a hidden layer are zero, then during the first forward pass, every neuron in that layer will compute the exact same output, regardless of the input. This is because the linear transformation (z=Wx+b) will yield the same value (just the bias, if non-zero, or zero if biases are also zero) for all neurons in the layer.
Consequently, when applying the activation function, say a=g(z), all neurons in the layer still produce identical activations. During backpropagation, the calculated gradients (∂W∂L) will also be identical for all weights connecting to that layer. This means that during the weight update step (W=W−α∂W∂L), all weights will be updated by the same amount. The symmetry is never broken; all neurons in the layer remain identical throughout training, effectively reducing the capacity of the layer to that of a single neuron. Initializing all weights to the same non-zero constant value suffers from the exact same symmetry problem.
Initializing biases to zero is generally acceptable and quite common, especially when weights are initialized randomly. However, initializing all parameters to zero prevents the network from learning effectively.
To overcome the symmetry problem, we need to initialize weights with different values. The standard approach is to initialize weights randomly, typically drawing values from a probability distribution. A common starting point is to use small random numbers drawn from:
Why small values? If weights are too large, the input to activation functions like Sigmoid or Tanh (z=Wx+b) can also become very large (positive or negative). These activation functions saturate (flatten out) for large inputs, meaning their derivatives become close to zero. During backpropagation, these near-zero gradients cause the updates to preceding layers' weights to become vanishingly small, effectively stopping learning in those layers. This is known as the vanishing gradient problem. While initializing with small random numbers breaks symmetry and avoids immediate saturation, it might not be optimal for deeper networks.
As networks become deeper, ensuring that the signal (activations during forward propagation) and gradients (during backward propagation) flow properly through all layers becomes more important. If activations or gradients consistently shrink or grow layer by layer, training can become unstable or stall. Initialization methods like Xavier/Glorot and He were developed specifically to address this. They aim to maintain the variance of activations and gradients roughly constant across layers.
These methods scale the initial random weights based on the number of input connections (fan-in, nin) and output connections (fan-out, nout) of a neuron or layer.
A neuron j receives input connections from n neurons in the previous layer (fan-in) and sends output connections to m neurons in the next layer (fan-out). Initialization strategies like Xavier and He use these values to scale initial weights.
Proposed by Glorot and Bengio (2010), this method works well with symmetric activation functions like Tanh and Sigmoid. It aims to keep the variance of activations the same across layers by scaling weights based on both fan-in and fan-out.
Proposed by He et al. (2015), this method is specifically designed for the ReLU activation function and its variants (like Leaky ReLU). ReLU is not symmetric and effectively "kills" half of the activations (outputs zero for negative inputs). He initialization accounts for this by only considering the fan-in, leading to slightly larger initial weights compared to Xavier, which helps counteract the variance reduction caused by ReLU.
Choosing the right initialization strategy is an important step in setting up your network for successful training. While simple random initialization might work for shallow networks, methods like Xavier/Glorot and He provide a more principled approach that helps maintain signal propagation and gradient flow, especially in deeper architectures. With the network architecture defined and parameters initialized, we are ready to start the core training process, beginning with the forward pass to generate predictions using these initial weights and biases.
© 2025 ApX Machine Learning