Internal covariate shift describes the changing distribution of layer inputs during training, which can hinder the learning process. Batch Normalization (BN) aims to mitigate this by normalizing the inputs to a layer for each mini-batch. The calculation of this normalization during the forward pass is examined.
The core idea is to take the activations arriving at the Batch Normalization layer for a given mini-batch and transform them so they have approximately zero mean and unit variance. This standardization happens independently for each feature or channel. However, simply forcing a zero mean and unit variance might limit the layer's representational power. Therefore, BN introduces two learnable parameters per feature, (gamma) and (beta), that allow the network to scale and shift the normalized values. This means the network can learn the optimal scale and mean for the inputs to the next layer.
Consider a mini-batch of activations for a specific feature (e.g., the output of a single neuron for different examples in the mini-batch). The Batch Normalization forward pass involves these steps:
Calculate Mini-Batch Mean (): Compute the average value of the activations for this feature across the mini-batch.
Calculate Mini-Batch Variance (): Compute the variance of the activations for this feature across the mini-batch.
Normalize (): Normalize each activation in the mini-batch using the calculated mean and variance. A small constant (epsilon, e.g., ) is added to the variance inside the square root for numerical stability, preventing division by zero if the variance is very small.
After this step, the normalized activations for the mini-batch will have a mean close to 0 and a variance close to 1.
Scale and Shift (): Transform the normalized activation using the learnable parameters and . These parameters are initialized (often to 1 and 0, respectively) and updated during backpropagation just like other network weights.
The output is the final result of the Batch Normalization layer for the input activation , and it's passed on to the subsequent layer (typically followed by a non-linear activation function).
Flow of the Batch Normalization forward pass for a single feature across a mini-batch. Inputs are used to compute mini-batch statistics (), which then normalize each to . Finally, learnable parameters and scale and shift to produce the output .
It's important to remember that and are learned per feature. If a fully connected layer has output neurons, there will be pairs of . If a convolutional layer has output channels, there will be pairs of used to normalize across the batch, height, and width dimensions for each channel.
This process ensures that the inputs to the next layer have a stable distribution during training, controlled by the learned and parameters, which significantly helps in stabilizing and accelerating the training process.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with