The changing distributions of layer inputs during training, known as internal covariate shift, can significantly complicate the training process for deep networks. This phenomenon forces the use of smaller learning rates and careful parameter initialization, slowing down convergence and making training more fragile.Batch Normalization (BN) was introduced by Sergey Ioffe and Christian Szegedy in 2015 as a technique to directly address this problem. The core idea is straightforward yet effective: normalize the inputs to a layer for each mini-batch during training. Instead of letting the distribution of inputs to a layer shift wildly, BN aims to keep the mean and variance of these inputs more stable.Think about how we often standardize the input features to a machine learning model (e.g., subtracting the mean and dividing by the standard deviation). Batch Normalization applies a similar principle, but it does so inside the network, for the inputs to specific layers.Specifically, for a given layer, Batch Normalization performs the following steps during training on a mini-batch:Calculate Mini-Batch Statistics: It computes the mean ($ \mu_{\mathcal{B}} $) and variance ($ \sigma^2_{\mathcal{B}} $) of the layer's inputs over the current mini-batch $ \mathcal{B} $.Normalize: It normalizes each input $ x_i $ in the mini-batch using these statistics: $$ \hat{x}i = \frac{x_i - \mu{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}} $$ Here, $ \epsilon $ is a small constant added for numerical stability (to avoid division by zero if the variance is very small).Scale and Shift: The normalized value $ \hat{x}i $ is then scaled by a learnable parameter $ \gamma $ (gamma) and shifted by another learnable parameter $ \beta $ (beta): $$ y_i = \gamma \hat{x}i + \beta $$ These $ \gamma $ and $ \beta $ parameters are learned alongside the network's weights during training. They allow the network to potentially undo the normalization if that's optimal for the representation needed by the next layer. If the network learns $ \gamma = \sqrt{\sigma^2{\mathcal{B}} + \epsilon} $ and $ \beta = \mu{\mathcal{B}} $, it can effectively recover the original activation. However, in practice, the network learns optimal $ \gamma $ and $ \beta $ values that help stabilize training.This entire operation is typically inserted just before the activation function of a layer. For example, in a fully connected layer, the sequence might become: Linear transformation -> Batch Normalization -> Activation function (e.g., ReLU).digraph G { rankdir=LR; node [shape=box, style=rounded, fontname="Helvetica"]; edge [fontname="Helvetica"]; subgraph cluster_0 { label = "Standard Layer"; style=dashed; color="#adb5bd"; Input -> Linear [label="z = Wx + b"]; Linear -> Activation [label="a = g(z)"]; Activation -> Output; } subgraph cluster_1 { label = "Layer with Batch Normalization"; style=dashed; color="#adb5bd"; Input_BN [label="Input"]; Linear_BN [label="Linear"]; BN [label="Batch Norm\n(Normalize, Scale+Shift)", shape=ellipse, style=filled, fillcolor="#a5d8ff"]; Activation_BN [label="Activation"]; Output_BN [label="Output"]; Input_BN -> Linear_BN [label="z = Wx + b"]; Linear_BN -> BN [label="Input to BN: z"]; BN -> Activation_BN [label="Input to Activation: y = γẑ + β"]; Activation_BN -> Output_BN [label="a = g(y)"]; } } Comparison of data flow through a standard layer versus a layer incorporating Batch Normalization before the activation function.By normalizing the inputs within each mini-batch, Batch Normalization helps stabilize the learning process. This stabilization often allows for the use of much higher learning rates, significantly speeding up training. Furthermore, it reduces the dependence on careful initialization and can act as a form of regularization, sometimes reducing the need for other techniques like Dropout.We will look into the precise calculations for the forward and backward passes, its behavior during testing (inference), and its various benefits in the following sections. For now, understand that Batch Normalization is a powerful tool inserted into the network architecture to regulate the internal statistics of activations and promote more stable and efficient training.