All Courses

Batch Normalization: Backward Pass Calculation

Having established how Batch Normalization (BN) works during the forward pass, normalizing inputs using mini-batch statistics and then scaling and shifting them, we now turn to the backward pass. To train the network using gradient descent, we need to calculate how the loss $L$ changes with respect to the BN layer's inputs $x_i$ and its learnable parameters $\gamma$ and $\beta$ . This involves applying the chain rule through the BN transformation.

Let's recall the forward pass for a single activation $x_i$ within a mini-batch $\mathcal{B} = \{x_1, ..., x_m\}$ :

Calculate mini-batch mean: $\mu_\mathcal{B} = \frac{1}{m} \sum_{i=1}^{m} x_i$
Calculate mini-batch variance: $\sigma_\mathcal{B}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_\mathcal{B})^2$
Normalize the input: $\hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}$ (where $\epsilon$ is a small constant for numerical stability)
Scale and shift: $y_i = \gamma \hat{x}_i + \beta$

During backpropagation, we receive the gradient of the loss with respect to the BN layer's output, $\frac{\partial L}{\partial y_i}$ , from the subsequent layer. Our goal is to compute $\frac{\partial L}{\partial x_i}$ , $\frac{\partial L}{\partial \gamma}$ , and $\frac{\partial L}{\partial \beta}$ .

Gradients for Learnable Parameters ( $\gamma$ and $\beta$ )

These are the most straightforward gradients to compute using the chain rule:

Gradient with respect to $\beta$ : The parameter $\beta$ directly adds to the output $y_i$ .
$\frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} (1) = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i}$
The gradient for $\beta$ is simply the sum of the incoming gradients from the outputs $y_i$ .
Gradient with respect to $\gamma$ : The parameter $\gamma$ scales the normalized input $\hat{x}_i$ .
$\frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} (\hat{x}_i) = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \hat{x}_i$
The gradient for $\gamma$ is the sum of the incoming gradients, each weighted by the corresponding normalized input $\hat{x}_i$ .

Gradient with respect to the Input ( $x_i$ )

Calculating the gradient with respect to the input $x_i$ is more involved because $x_i$ influences the output $y_i$ in multiple ways:

Directly through the numerator $(x_i - \mu_\mathcal{B})$ in $\hat{x}_i$ .
Indirectly through the mini-batch mean $\mu_\mathcal{B}$ , which depends on all $x_j$ in the batch.
Indirectly through the mini-batch variance $\sigma_\mathcal{B}^2$ , which also depends on all $x_j$ (including $x_i$ ) and $\mu_\mathcal{B}$ .

We need to apply the chain rule carefully, considering all these paths. Let $\sigma_{\mathcal{B},\epsilon} = \sqrt{\sigma_\mathcal{B}^2 + \epsilon}$ . The gradient computation proceeds backward through the operations:

Gradient w.r.t. normalized input $\hat{x}_i$ :
$\frac{\partial L}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i} \gamma$
Gradients w.r.t. $\mu_\mathcal{B}$ and $\sigma_\mathcal{B}^2$ : These require summing contributions from all $\hat{x}_j$ in the mini-batch, as both statistics affect all normalized inputs.
$\frac{\partial L}{\partial \sigma_\mathcal{B}^2} = \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{x}_i} \frac{\partial \hat{x}_i}{\partial \sigma_\mathcal{B}^2} = \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{x}_i} (x_i - \mu_\mathcal{B}) \left( -\frac{1}{2} (\sigma_\mathcal{B}^2 + \epsilon)^{-3/2} \right)$ $\frac{\partial L}{\partial \mu_\mathcal{B}} = \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{x}_i} \frac{\partial \hat{x}_i}{\partial \mu_\mathcal{B}} = \left( \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{x}_i} \frac{-1}{\sigma_{\mathcal{B},\epsilon}} \right) + \frac{\partial L}{\partial \sigma_\mathcal{B}^2} \frac{\partial \sigma_\mathcal{B}^2}{\partial \mu_\mathcal{B}}$
where $\frac{\partial \sigma_\mathcal{B}^2}{\partial \mu_\mathcal{B}} = \frac{1}{m} \sum_{j=1}^{m} 2(x_j - \mu_\mathcal{B})(-1) = \frac{-2}{m} \sum_{j=1}^{m} (x_j - \mu_\mathcal{B}) = 0$ . So, the second term vanishes, simplifying the gradient for $\mu_\mathcal{B}$ :
$\frac{\partial L}{\partial \mu_\mathcal{B}} = \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{x}_i} \frac{-1}{\sigma_{\mathcal{B},\epsilon}}$
Gradient w.r.t. input $x_i$ : Now we combine the paths. An input $x_i$ influences the loss through $\hat{x}_i$ , $\mu_\mathcal{B}$ , and $\sigma_\mathcal{B}^2$ .
$\frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial \hat{x}_i} \frac{\partial \hat{x}_i}{\partial x_i} + \frac{\partial L}{\partial \sigma_\mathcal{B}^2} \frac{\partial \sigma_\mathcal{B}^2}{\partial x_i} + \frac{\partial L}{\partial \mu_\mathcal{B}} \frac{\partial \mu_\mathcal{B}}{\partial x_i}$
We need the partial derivatives of the statistics w.r.t. a single input $x_i$ :
- $\frac{\partial \hat{x}_i}{\partial x_i} = \frac{1}{\sigma_{\mathcal{B},\epsilon}}$ (direct path, ignoring dependencies through mean/variance for this term)
- $\frac{\partial \sigma_\mathcal{B}^2}{\partial x_i} = \frac{2(x_i - \mu_\mathcal{B})}{m}$
- $\frac{\partial \mu_\mathcal{B}}{\partial x_i} = \frac{1}{m}$
Substituting these gives the final expression for $\frac{\partial L}{\partial x_i}$ :
$\frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial \hat{x}_i} \frac{1}{\sigma_{\mathcal{B},\epsilon}} + \frac{\partial L}{\partial \sigma_\mathcal{B}^2} \frac{2(x_i - \mu_\mathcal{B})}{m} + \frac{\partial L}{\partial \mu_\mathcal{B}} \frac{1}{m}$
Combining everything and simplifying (the full derivation is quite detailed, often found in appendices of papers or textbooks), the result can be expressed more compactly. A common form is:
$\frac{\partial L}{\partial x_i} = \frac{1}{m \sigma_{\mathcal{B},\epsilon}} \left( m \frac{\partial L}{\partial \hat{x}_i} - \sum_{j=1}^{m} \frac{\partial L}{\partial \hat{x}_j} - \hat{x}_i \sum_{j=1}^{m} \frac{\partial L}{\partial \hat{x}_j} \hat{x}_j \right)$
Note that $\frac{\partial L}{\partial \hat{x}_j} = \frac{\partial L}{\partial y_j} \gamma$ .

The main takeaway is that the gradient $\frac{\partial L}{\partial x_i}$ depends not only on the gradient $\frac{\partial L}{\partial y_i}$ corresponding to that specific activation, but also on the gradients and values of all other activations in the mini-batch ( $j=1...m$ ) due to the shared mean and variance calculations.

Visualizing the Gradient Flow

The dependencies during the backward pass can be visualized. Consider the computation graph for a single output $y_i$ and how the loss gradient flows back to input $x_i$ , incorporating the influence of the shared $\mu_\mathcal{B}$ and $\sigma_\mathcal{B}^2$ .

This diagram illustrates the dependencies in the Batch Normalization calculations and the flow of gradients during backpropagation. Notice how the input $x_i$ receives gradient contributions directly from $\hat{x}_i$ and indirectly through the mini-batch statistics $\mu_\mathcal{B}$ and $\sigma_\mathcal{B}^2$ .

Implementation in Frameworks

Fortunately, you rarely need to implement this backward pass manually. Deep learning frameworks like PyTorch and TensorFlow use automatic differentiation (autograd) to compute these gradients automatically when you define a model with BN layers. For example, in PyTorch:

import torch
import torch.nn as nn

# Example setup
batch_size = 4
num_features = 10
input_tensor = torch.randn(batch_size, num_features, requires_grad=True)

# Define a Batch Norm layer (affine=True means learnable gamma and beta)
bn_layer = nn.BatchNorm1d(num_features=num_features, affine=True)

# Forward pass
output = bn_layer(input_tensor)

# Assume some dummy loss for demonstration
loss = output.mean()

# Backward pass
loss.backward()

# Gradients are now computed and stored
# Gradient w.r.t input: input_tensor.grad
# Gradient w.r.t gamma (weight): bn_layer.weight.grad
# Gradient w.r.t beta (bias): bn_layer.bias.grad

print("Shape of input gradient:", input_tensor.grad.shape)
print("Shape of gamma gradient:", bn_layer.weight.grad.shape)
print("Shape of beta gradient:", bn_layer.bias.grad.shape)

# >>> Shape of input gradient: torch.Size([4, 10])
# >>> Shape of gamma gradient: torch.Size([10])
# >>> Shape of beta gradient: torch.Size([10])

While the framework handles the mechanics, understanding the underlying calculations, especially the dependency on the entire mini-batch for the input gradient $\frac{\partial L}{\partial x_i}$ , is valuable for interpreting model behavior and potential issues during training. This understanding helps appreciate why BN affects training dynamics and generalization performance, which we will discuss next.

Was this section helpful?

Batch Normalization: Backward Pass Calculation

Gradients for Learnable Parameters (γ\gammaγ and β\betaβ)

Gradient with respect to the Input (xix_ixi​)

Visualizing the Gradient Flow

Implementation in Frameworks

Gradients for Learnable Parameters ( $\gamma$ and $\beta$ )

Gradient with respect to the Input ( $x_i$ )