As we discussed in the previous section, the changing distributions of layer inputs during training, known as internal covariate shift, can significantly complicate the training process for deep networks. It forces us to use smaller learning rates and careful parameter initialization, slowing down convergence and making training more fragile.
Batch Normalization (BN) was introduced by Sergey Ioffe and Christian Szegedy in 2015 as a technique to directly address this problem. The core idea is straightforward yet effective: normalize the inputs to a layer for each mini-batch during training. Instead of letting the distribution of inputs to a layer shift wildly, BN aims to keep the mean and variance of these inputs more stable.
Think about how we often standardize the input features to a machine learning model (e.g., subtracting the mean and dividing by the standard deviation). Batch Normalization applies a similar principle, but it does so inside the network, for the inputs to specific layers.
Specifically, for a given layer, Batch Normalization performs the following steps during training on a mini-batch:
This entire operation is typically inserted just before the activation function of a layer. For example, in a fully connected layer, the sequence might become: Linear transformation -> Batch Normalization -> Activation function (e.g., ReLU).
Comparison of data flow through a standard layer versus a layer incorporating Batch Normalization before the activation function.
By normalizing the inputs within each mini-batch, Batch Normalization helps stabilize the learning process. This stabilization often allows for the use of much higher learning rates, significantly speeding up training. Furthermore, it reduces the dependence on careful initialization and can act as a form of regularization, sometimes reducing the need for other techniques like Dropout.
We will look into the precise calculations for the forward and backward passes, its behavior during testing (inference), and its various benefits in the following sections. For now, understand that Batch Normalization is a powerful tool inserted into the network architecture to regulate the internal statistics of activations and promote more stable and efficient training.
© 2025 ApX Machine Learning