Proper normalization is significant for stable and efficient training of deep neural networks, and diffusion models are no exception. While Batch Normalization (BatchNorm) is ubiquitous in many vision tasks, its reliance on batch statistics can be problematic for diffusion models due to:
Therefore, alternative normalization techniques that are independent of the current batch are preferred within the U-Net architectures used for diffusion. Two prominent methods are Group Normalization and Adaptive Layer Normalization.
Group Normalization (GroupNorm) offers a middle ground between Layer Normalization (which normalizes across all channels) and Instance Normalization (which normalizes each channel independently). It operates by dividing the channels into a predefined number of groups and computing the mean and variance for normalization within each group.
Let x be the input feature map with shape (N,C,H,W), where N is the batch size, C is the number of channels, and H,W are spatial dimensions. GroupNorm divides the C channels into G groups, each containing C/G channels (assuming C is divisible by G). Normalization is then applied independently to each group across the (C/G,H,W) dimensions.
The mean μg and standard deviation σg for a group g are calculated as:
μg=(C/G)HW1k∈Sg∑h=1∑Hw=1∑Wxn,k,h,w σg=(C/G)HW1k∈Sg∑h=1∑Hw=1∑W(xn,k,h,w−μg)2+ϵwhere Sg is the set of channel indices belonging to group g, and ϵ is a small constant for numerical stability. The normalized output x^ is then:
x^n,c,h,w=σgxn,c,h,w−μgwhere c is in group g. Finally, learnable scale (γ) and shift (β) parameters are applied per channel:
yn,c,h,w=γcx^n,c,h,w+βcThe key advantage of GroupNorm is its independence from the batch size N. Its statistics are computed only over channel groups and spatial dimensions for each sample independently, making it robust to small or fluctuating batch sizes common in diffusion model training. It is typically inserted within residual blocks, often before the non-linear activation function. The number of groups G is a hyperparameter, often set to a value like 32.
Data flow for Group Normalization, highlighting the grouping of channels before computing statistics.
While GroupNorm provides stable normalization, diffusion models often benefit from conditional normalization, where the normalization process itself is modulated by conditioning information, most importantly the diffusion timestep t. Adaptive Layer Normalization (AdaLN), often used in the form of AdaLN-Zero, achieves this.
AdaLN builds upon Layer Normalization (LayerNorm), which normalizes features across all channels for a given sample. The core idea is to make the learnable scale (γ) and shift (β) parameters functions of the conditioning input, such as a timestep embedding et.
Standard LayerNorm applies:
y=γσx−μ+βwhere μ and σ are computed across all channels C (and spatial dimensions if applied to feature maps).
In AdaLN, the scale and shift are predicted from the conditioning embedding:
A common variant, particularly in diffusion models, is AdaLN-Zero. This initializes the modulation parameters such that they initially have no effect, promoting stability early in training. The output block applying the modulation is often structured as:
y=x^∗(1+γt)+βtHere, the linear layer predicting γt and βt is initialized to output zeros. Thus, initially, y=x^, and the network learns to modulate the features gradually as training progresses.
AdaLN (or AdaLN-Zero) is particularly effective in diffusion U-Nets because it allows the network to tailor its feature processing based on the noise level (timestep t). It's frequently used within residual blocks, often applied after the main convolution and normalization (like GroupNorm), specifically controlling the output of the residual branch before it's added back to the identity connection.
Data flow for Adaptive Layer Normalization (AdaLN-Zero variant). The conditioning embedding is used to predict scale and shift parameters that modulate the normalized features.
In advanced U-Net architectures for diffusion:
Using both GroupNorm for general stability and AdaLN for conditional modulation provides a powerful combination for building high-performance diffusion models. The exact placement and configuration (number of groups for GroupNorm, architecture of the MLP predicting AdaLN parameters) are design choices that can impact model performance and training dynamics.
© 2025 ApX Machine Learning