To effectively reverse the diffusion process, we need a model capable of estimating the noise that was added to an image x0 to produce a noisy version xt at a specific timestep t. Given xt and t, the model's task is to predict the noise ϵ that was sampled from a Gaussian distribution during the forward process for that step. The standard neural network architecture chosen for this task, particularly in image generation, is the U-Net.
Originally developed for biomedical image segmentation, the U-Net architecture has proven remarkably effective for diffusion models. Its design is well-suited for tasks where the input and output share the same spatial dimensions (like an image and its corresponding noise map) and where preserving fine spatial details while also considering broader context is important.
The U-Net gets its name from its characteristic U-shape when visualized. It consists of three main parts:
Crucially, it also employs Skip Connections that bridge corresponding layers between the downsampling and upsampling paths. Let's examine each part:
The encoder path functions like a typical convolutional neural network used for classification or feature extraction. It takes the input (the noisy image xt, along with timestep information t, which we'll discuss how to integrate in the next section) and processes it through a series of levels. Each level typically consists of:
The purpose of the encoder is to gradually reduce spatial resolution while increasing the semantic complexity of the learned features. By downsampling, the network gains a larger receptive field in deeper layers, allowing it to capture contextual information from a wider area of the input image. This is necessary for understanding the overall structure and content, which helps in predicting the appropriate noise pattern.
This is the lowest point in the 'U' shape, connecting the encoder and decoder paths. It typically consists of one or more convolutional layers. The bottleneck represents the input image in a highly compressed, low-spatial-resolution, high-level feature representation. It captures the most salient, abstract information learned by the encoder.
The decoder path works to gradually increase the spatial resolution of the feature maps back to the original input size, ultimately producing the predicted noise map ϵθ. Each level in the decoder typically involves:
The decoder essentially learns to reconstruct the detailed noise map by progressively combining the high-level information passed up from the bottleneck with the fine-grained, high-resolution features provided by the skip connections from the encoder.
Skip connections are direct links that pass feature maps from layers in the downsampling path (encoder) to corresponding layers in the upsampling path (decoder). "Corresponding" usually means layers with the same spatial resolution.
Why are these so important for noise prediction in diffusion models?
Without skip connections, the decoder would only receive information from the highly compressed bottleneck representation, making it very difficult to reconstruct a precise, pixel-level noise map that respects the original image structure.
Diagram illustrating the U-Net structure. Arrows indicate the flow of data. The encoder path progressively reduces spatial dimensions while the decoder path increases them. Skip connections (dashed violet lines) transfer high-resolution features from the encoder to the decoder.
In summary, the U-Net architecture provides an effective combination of contextual understanding (through the encoder and bottleneck) and precise spatial localization (enabled by the decoder and skip connections). This makes it highly suitable for the task of predicting a noise map ϵθ that has the same dimensions as the input noisy image xt and accurately reflects the noise pattern corresponding to the image content and the noise level indicated by timestep t. The final layer of the U-Net is typically a convolution (e.g., 1x1 or 3x3) that maps the feature representation to the desired number of output channels (e.g., 3 for an RGB image's noise).
© 2025 ApX Machine Learning