The effectiveness of diffusion models hinges significantly on the neural network architecture used to predict the noise (or score) at each step of the reverse process. Given a noisy input xt and the timestep t, the network, denoted ϵθ, must learn to estimate the noise ϵ that was added to reach xt from xt−1 (or more commonly, from the original x0 as used in the simplified loss function). This task requires an architecture capable of processing spatial information effectively while being conditioned on the time variable t. The predominant architecture choice for this purpose, particularly in image generation, is the U-Net.
The U-Net Architecture for Noise Prediction
The U-Net architecture was originally developed for biomedical image segmentation but has proven exceptionally well-suited for the conditional generation task within diffusion models. Its structure facilitates the handling of image-like data where the output (predicted noise) must have the same spatial dimensions as the input (noisy image).
Key characteristics make the U-Net suitable:
- Encoder-Decoder Structure: The U-Net consists of a downsampling path (encoder) that captures contextual information at progressively lower spatial resolutions but with increasing feature channels, and an upsampling path (decoder) that gradually reconstructs the spatial resolution.
- Skip Connections: Crucially, skip connections link feature maps from the encoder path directly to corresponding layers in the decoder path. These connections allow the decoder to access high-resolution features from the encoder, which are essential for accurately predicting fine-grained details in the noise pattern and ultimately generating sharp, detailed images. Without them, much of the spatial information lost during downsampling would be difficult to recover.
- Multi-Scale Processing: The hierarchical nature of the U-Net allows it to process information at multiple scales simultaneously, effectively capturing both local textures and global structures within the image data.
A typical U-Net implementation for diffusion models involves:
- Downsampling Path: A series of convolutional blocks (often including convolutions, normalization layers, and activation functions) interspersed with downsampling operations (like max pooling or strided convolutions).
- Bottleneck: One or more convolutional blocks at the lowest spatial resolution, connecting the encoder and decoder.
- Upsampling Path: A series of blocks involving upsampling operations (like transposed convolutions or bilinear/nearest neighbor upsampling followed by a convolution), concatenation of features from the corresponding skip connection, and further convolutional layers.
- Output Layer: A final convolutional layer (e.g., a 1×1 convolution) maps the features from the last decoder stage to the desired output shape, which is the predicted noise ϵθ, having the same dimensions as the input xt.
Diagram illustrating the U-Net architecture commonly used in diffusion models. Input xt goes through an encoder path (blue), a bottleneck (green), and a decoder path (red) to produce the predicted noise ϵθ. Skip connections (dashed lines) pass feature maps from the encoder to corresponding layers in the decoder. H, W, C are height, width, channels; F is the base number of features.
Incorporating Timestep Information
The diffusion process evolves over time t, and the characteristics of the noise to be predicted change accordingly (e.g., higher noise levels at larger t). Therefore, the U-Net must be conditioned on the current timestep t. A common and effective method is to use timestep embeddings.
Inspired by positional embeddings used in Transformers, the discrete timestep t (ranging from 0 to T) is first transformed into a high-dimensional vector representation. The standard approach uses sinusoidal embeddings:
PE(t,2i)=sin(t/100002i/demb)
PE(t,2i+1)=cos(t/100002i/demb)
where demb is the dimension of the embedding vector, and i indexes the components of the vector (0≤i<demb/2). This fixed embedding provides a unique representation for each timestep that the network can learn to interpret. Often, this sinusoidal embedding is further processed through a small multi-layer perceptron (MLP) before being integrated into the U-Net.
How is this time embedding vector emb(t) used within the U-Net?
- Addition/Concatenation: A simple method is to broadcast the embedding vector spatially and add or concatenate it to the feature maps at various points within the network, typically within the residual blocks of the U-Net.
- Adaptive Group Normalization (AdaGN): The embedding can be used to predict scale and shift parameters applied after normalization layers (like Group Normalization), effectively modulating the activations based on the timestep.
- FiLM Layers: Feature-wise Linear Modulation layers use the time embedding to predict a scale (γ) and bias (β) vector per feature channel. These are then applied element-wise to the feature map h: γ(emb(t))⊙h+β(emb(t)). This allows the timestep to exert fine-grained control over the network's processing.
Attention Mechanisms
While convolutional layers excel at capturing local patterns, they can struggle with long-range dependencies, especially in high-resolution images. To address this, self-attention mechanisms are often incorporated into the U-Net architecture for diffusion models.
- Self-Attention Layers: These layers allow different spatial locations (pixels or patches) in the feature maps to attend to each other, enabling the model to capture relationships between distant parts of the image. They are typically inserted at lower-resolution stages of the U-Net (e.g., 16×16 or 8×8 feature map resolutions) where the computational cost is more manageable. Adding attention helps the model generate more globally coherent structures.
- Cross-Attention: For conditional diffusion models (e.g., text-to-image), cross-attention layers are used. Here, the image feature maps (query) attend to the conditioning information, such as text embeddings (key, value). This allows the generation process to be guided by the conditioning input. This mechanism is fundamental to techniques like classifier-free guidance, where conditioning vectors are integrated via cross-attention.
Other Architectural Details
Several other design choices contribute to the performance of U-Nets in diffusion models:
- Activation Functions: Standard ReLU activations are often replaced with smoother alternatives like Swish (or SiLU) and GeLU, which have been found to perform well in deep generative models.
- Normalization: Group Normalization is frequently preferred over Batch Normalization. Batch Normalization's statistics depend on the batch, which can be problematic as the statistics of noisy images xt change significantly with the timestep t. Group Normalization computes statistics independently for each sample over groups of channels, making it more stable in this context.
- Residual Connections: Within each convolutional block of the U-Net (in addition to the main skip connections), residual connections are commonly used. These help stabilize training and allow for deeper networks by mitigating vanishing gradients.
- Network Size: The depth of the U-Net, the number of feature channels at each level, and the number of attention heads are important hyperparameters. Larger models generally yield better sample quality but require significantly more computational resources and data for training.
In summary, the U-Net architecture, enhanced with timestep embeddings, skip connections, residual blocks, suitable normalization and activation functions, and often self-attention mechanisms, provides a powerful and flexible framework for the core noise (or score) prediction task in diffusion models. While variations and alternative architectures are subjects of ongoing research, this adapted U-Net remains the cornerstone for many state-of-the-art diffusion models, particularly in the image domain. The hands-on practical section later in this chapter will guide you through implementing a basic version of this architecture.