As discussed, internal covariate shift describes the changing distribution of layer inputs during training, which can hinder the learning process. Batch Normalization (BN) aims to mitigate this by normalizing the inputs to a layer for each mini-batch. Let's examine how this normalization is calculated during the forward pass.
The core idea is to take the activations arriving at the Batch Normalization layer for a given mini-batch and transform them so they have approximately zero mean and unit variance. This standardization happens independently for each feature or channel. However, simply forcing a zero mean and unit variance might limit the layer's representational power. Therefore, BN introduces two learnable parameters per feature, γ (gamma) and β (beta), that allow the network to scale and shift the normalized values. This means the network can learn the optimal scale and mean for the inputs to the next layer.
Consider a mini-batch B={x1,x2,...,xm} of activations for a specific feature (e.g., the output of a single neuron for m different examples in the mini-batch). The Batch Normalization forward pass involves these steps:
Calculate Mini-Batch Mean (μB): Compute the average value of the activations for this feature across the mini-batch.
μB=m1i=1∑mxiCalculate Mini-Batch Variance (σB2): Compute the variance of the activations for this feature across the mini-batch.
σB2=m1i=1∑m(xi−μB)2Normalize (x^i): Normalize each activation xi in the mini-batch using the calculated mean and variance. A small constant ϵ (epsilon, e.g., 1e−5) is added to the variance inside the square root for numerical stability, preventing division by zero if the variance is very small.
x^i=σB2+ϵxi−μBAfter this step, the normalized activations x^i for the mini-batch will have a mean close to 0 and a variance close to 1.
Scale and Shift (yi): Transform the normalized activation x^i using the learnable parameters γ and β. These parameters are initialized (often to 1 and 0, respectively) and updated during backpropagation just like other network weights.
yi=γx^i+βThe output yi is the final result of the Batch Normalization layer for the input activation xi, and it's passed on to the subsequent layer (typically followed by a non-linear activation function).
Flow of the Batch Normalization forward pass for a single feature across a mini-batch. Inputs xi are used to compute mini-batch statistics (μB,σB2), which then normalize each xi to x^i. Finally, learnable parameters γ and β scale and shift x^i to produce the output yi.
It's important to remember that γ and β are learned per feature. If a fully connected layer has N output neurons, there will be N pairs of (γ,β). If a convolutional layer has C output channels, there will be C pairs of (γ,β) used to normalize across the batch, height, and width dimensions for each channel.
This process ensures that the inputs to the next layer have a stable distribution during training, controlled by the learned γ and β parameters, which significantly helps in stabilizing and accelerating the training process.
© 2025 ApX Machine Learning