Training a deep neural network involves updating the parameters (weights and biases) across all its layers based on the calculated gradients. Think about a specific layer, say Layer L, situated somewhere in the middle of the network. Its learning process, meaning how its parameters are adjusted, heavily depends on the data it receives as input from the preceding layer, L−1.
However, here's the challenge: during training, the parameters of Layer L−1, and indeed all layers before it, are also constantly changing. As these earlier layers update their weights and biases, the distribution of the outputs they produce shifts. Consequently, the distribution of the inputs arriving at Layer L doesn't remain fixed throughout the training process. This phenomenon, where the distribution of inputs to internal layers changes during training, is known as Internal Covariate Shift.
This constant shifting poses several difficulties for training deep networks:
Each layer in the network is trying to learn a useful representation or transformation of its input. When the distribution of that input keeps changing, the layer is essentially chasing a moving target. It has to continuously adapt not only to learn the desired mapping for the task but also to compensate for the shifting statistics of its incoming data. This makes the learning process less efficient and often requires using smaller learning rates to prevent the parameter updates from becoming unstable, significantly slowing down overall training time.
Many activation functions, particularly sigmoids and hyperbolic tangents (tanh), saturate when their inputs become very large or very small (positive or negative). If internal covariate shift pushes the inputs to a layer into these saturated regions, the gradient of the activation function becomes close to zero. During backpropagation, these small gradients hinder the flow of information back to earlier layers, leading to the vanishing gradient problem and stalling learning in those layers. While non-saturating activations like ReLU (f(x)=max(0,x)) avoid saturation for positive inputs, unstable shifts in input distributions can still lead to issues like large gradients or neurons that effectively "die" (always output zero).
Because the initial distributions heavily influence the subsequent shifts, the choice of initial weights and biases becomes quite significant. A poor initialization might start the network in a state where layer inputs are already poorly distributed (e.g., causing immediate saturation or zero activations), making it difficult for training to even begin effectively. The network might get stuck or require much longer to find a good solution path.
To maintain stability while the input distributions are shifting, optimizers often need to use smaller learning rates. If the learning rate is too high, a parameter update based on the current input distribution might be inappropriate or even detrimental once the distribution shifts in the next step, leading to oscillations or divergence. This constraint limits how quickly we can traverse the loss landscape.
Let's consider a simplified view of how the statistics of the inputs to a layer might drift. The following plot conceptually illustrates how the mean and standard deviation of activations fed into a particular layer could vary across training batches if no normalization is applied.
A conceptual plot showing potential fluctuations in the mean and standard deviation of a layer's input activations across successive training batches. These shifts illustrate internal covariate shift.
As the plot suggests, a layer might receive inputs with significantly different statistical properties from one batch to the next. This instability necessitates adaptation by the layer and complicates the overall optimization dynamics.
Techniques designed to counteract internal covariate shift aim to stabilize these distributions. By normalizing the inputs to each layer, we can create a more consistent learning environment, potentially allowing for faster training, higher learning rates, and reduced sensitivity to initialization. Batch Normalization, which we will explore next, is one of the most popular and effective methods for achieving this stabilization.
© 2025 ApX Machine Learning