The U-Net architecture, originally developed for biomedical image segmentation, has become a workhorse for diffusion models due to its effectiveness in processing image-like data where spatial information is significant. In the context of diffusion, the U-Net typically functions as the core neural network ϵθ that predicts the noise added to an image xt at a specific timestep t.
A standard U-Net consists of two main paths connected by a bottleneck:
Encoder (Contracting Path): This path follows a typical convolutional neural network structure. It progressively reduces the spatial resolution of the input while increasing the number of feature channels. Each stage usually consists of:
Bottleneck: This is the lowest resolution layer connecting the encoder and decoder paths. It typically consists of convolutional layers that process the highly compressed feature representation.
Decoder (Expanding Path): This path symmetrically mirrors the encoder. It gradually increases the spatial resolution while decreasing the feature channels. Each stage typically involves:
Skip Connections: These direct links between encoder and decoder stages at the same spatial resolution are fundamental. They allow the decoder to access high-resolution features from the encoder that might be lost during downsampling. This is especially important for diffusion models, which need to generate fine details by accurately predicting the noise pattern across all spatial locations.
Final Output Layer: A final convolution (often 1x1) maps the feature channels from the last decoder stage to the desired output shape, which usually matches the input image dimensions (e.g., 3 channels for RGB image noise prediction).
In a standard diffusion setup (like DDPM), the U-Net ϵθ(xt,t) takes the noisy image xt and the current timestep t as input. Its objective is to predict the noise ϵ that was added to the original clean image x0 to produce xt according to the forward diffusion process schedule. The output of the U-Net is a tensor representing this predicted noise, having the same spatial dimensions and channel count as the input xt.
A simplified diagram of the U-Net architecture commonly used in diffusion models. Arrows indicate data flow, dashed lines represent skip connections concatenating features from the encoder to corresponding decoder stages.
t
indicates that timestep information is typically incorporated, though the mechanism will be detailed later.
The U-Net's structure is well-suited for the noise prediction task in diffusion models for several reasons:
While this standard U-Net forms a solid foundation, its performance and capabilities within diffusion models can be significantly enhanced by incorporating attention mechanisms, refining the integration of timestep and conditioning information, and adopting architectural variations for better efficiency and stability, as we will examine in the following sections.
© 2025 ApX Machine Learning