The U-Net architecture serves as a frequent backbone for diffusion models, effectively capturing spatial hierarchies through its encoder-decoder structure with skip connections. While effective, standard U-Net implementations require modifications to handle the specific demands of diffusion processes and complex generation tasks.
This chapter examines enhancements to the U-Net architecture tailored for diffusion models. We will analyze the integration of attention mechanisms, specifically self-attention and cross-attention, to improve feature representation and incorporate conditioning information. You will learn methods for effectively injecting timestep embeddings (t) and handling advanced conditioning inputs beyond simple class labels. We will also discuss architectural variations aimed at improving computational efficiency and training stability, including different normalization techniques like Group Normalization and Adaptive Layer Normalization (AdaLN). By the end of this chapter, you will understand how to implement and analyze these sophisticated U-Net variants for building more capable diffusion models.
2.1 The Standard U-Net in Diffusion Models
2.2 Attention Mechanisms in U-Nets (Self-Attention, Cross-Attention)
2.3 Integrating Time Embeddings
2.4 Advanced Conditioning Input Integration
2.5 Architectural Variants for Efficiency (Depth, Width, Pooling)
2.6 Normalization Techniques (GroupNorm, AdaLN)
2.7 Hands-on Practical: Modifying a U-Net with Attention
© 2025 ApX Machine Learning