As discussed, internal covariate shift describes the changing distribution of layer inputs during training, which can hinder the learning process. Batch Normalization (BN) aims to mitigate this by normalizing the inputs to a layer for each mini-batch. Let's examine how this normalization is calculated during the forward pass.The core idea is to take the activations arriving at the Batch Normalization layer for a given mini-batch and transform them so they have approximately zero mean and unit variance. This standardization happens independently for each feature or channel. However, simply forcing a zero mean and unit variance might limit the layer's representational power. Therefore, BN introduces two learnable parameters per feature, $ \gamma $ (gamma) and $ \beta $ (beta), that allow the network to scale and shift the normalized values. This means the network can learn the optimal scale and mean for the inputs to the next layer.Consider a mini-batch $ \mathcal{B} = {x_1, x_2, ..., x_m} $ of activations for a specific feature (e.g., the output of a single neuron for $ m $ different examples in the mini-batch). The Batch Normalization forward pass involves these steps:Calculate Mini-Batch Mean ($ \mu_{\mathcal{B}} $): Compute the average value of the activations for this feature across the mini-batch. $$ \mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i $$Calculate Mini-Batch Variance ($ \sigma^2_{\mathcal{B}} $): Compute the variance of the activations for this feature across the mini-batch. $$ \sigma^2_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2 $$Normalize ($ \hat{x}_i $): Normalize each activation $ x_i $ in the mini-batch using the calculated mean and variance. A small constant $ \epsilon $ (epsilon, e.g., $ 1e-5 $) is added to the variance inside the square root for numerical stability, preventing division by zero if the variance is very small. $$ \hat{x}i = \frac{x_i - \mu{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}} $$ After this step, the normalized activations $ \hat{x}_i $ for the mini-batch will have a mean close to 0 and a variance close to 1.Scale and Shift ($ y_i $): Transform the normalized activation $ \hat{x}_i $ using the learnable parameters $ \gamma $ and $ \beta $. These parameters are initialized (often to 1 and 0, respectively) and updated during backpropagation just like other network weights. $$ y_i = \gamma \hat{x}_i + \beta $$ The output $ y_i $ is the final result of the Batch Normalization layer for the input activation $ x_i $, and it's passed on to the subsequent layer (typically followed by a non-linear activation function).digraph G { rankdir=LR; node [shape=box, style=rounded, fontname="sans-serif", color="#adb5bd", fontcolor="#495057"]; edge [fontname="sans-serif", color="#868e96", fontcolor="#495057"]; subgraph cluster_batch { label = "Mini-Batch Input (x)"; style=dashed; color="#ced4da"; x1 [label="x₁"]; x2 [label="x₂"]; xdot [label="..."]; xm [label="xₘ"]; } subgraph cluster_calc { label = "Calculate Stats"; style=dashed; color="#ced4da"; calc_mean [label="Compute Mean\n(μ_B)", shape=ellipse, color="#74c0fc"]; calc_var [label="Compute Variance\n(σ²_B)", shape=ellipse, color="#74c0fc"]; } subgraph cluster_transform { label = "Transform"; style=dashed; color="#ced4da"; normalize [label="Normalize\n(using μ_B, σ²_B + ε)", shape=diamond, color="#63e6be"]; scale_shift [label="Scale (γ) & Shift (β)\nyᵢ = γx̂ᵢ + β", shape=diamond, color="#ffc078"]; } subgraph cluster_output { label = "Output (y)"; style=dashed; color="#ced4da"; y1 [label="y₁"]; y2 [label="y₂"]; ydot [label="..."]; ym [label="yₘ"]; } {x1, x2, xdot, xm} -> calc_mean [style=invis]; {x1, x2, xdot, xm} -> calc_var [style=invis]; calc_mean -> normalize [label="μ_B", style=dashed]; calc_var -> normalize [label="σ²_B", style=dashed]; normalize -> scale_shift [label="x̂ᵢ"]; scale_shift -> {y1, y2, ydot, ym} [style=invis]; // Connect individual inputs to processing x1 -> normalize [lhead=cluster_transform, style=dotted, arrowhead=none]; x2 -> normalize [lhead=cluster_transform, style=dotted, arrowhead=none]; xm -> normalize [lhead=cluster_transform, style=dotted, arrowhead=none]; normalize -> y1 [ltail=cluster_transform, style=dotted, arrowhead=none]; normalize -> y2 [ltail=cluster_transform, style=dotted, arrowhead=none]; normalize -> ym [ltail=cluster_transform, style=dotted, arrowhead=none]; } Flow of the Batch Normalization forward pass for a single feature across a mini-batch. Inputs $x_i$ are used to compute mini-batch statistics ($ \mu_{\mathcal{B}}, \sigma^2_{\mathcal{B}} $), which then normalize each $x_i$ to $ \hat{x}_i $. Finally, learnable parameters $ \gamma $ and $ \beta $ scale and shift $ \hat{x}_i $ to produce the output $y_i$.It's important to remember that $ \gamma $ and $ \beta $ are learned per feature. If a fully connected layer has $ N $ output neurons, there will be $ N $ pairs of $ (\gamma, \beta) $. If a convolutional layer has $ C $ output channels, there will be $ C $ pairs of $ (\gamma, \beta) $ used to normalize across the batch, height, and width dimensions for each channel.This process ensures that the inputs to the next layer have a stable distribution during training, controlled by the learned $ \gamma $ and $ \beta $ parameters, which significantly helps in stabilizing and accelerating the training process.