While U-Net architectures based on convolutional neural networks (CNNs) have been the standard backbone for many successful diffusion models, the transformer architecture, initially dominant in natural language processing, has shown significant promise for generative tasks, including image synthesis. Transformers excel at modeling long-range dependencies, offering a different approach to capturing relationships within data compared to the locality bias inherent in CNNs.
This chapter examines how transformer architectures can be effectively utilized within the diffusion model framework. We will cover:
By the end of this chapter, you will understand the structure and function of transformer-based diffusion models and be prepared to analyze and implement them.
3.1 Motivation for Transformers in Generative Modeling
3.2 Adapting Transformers for Image Data (ViT, Patch Embeddings)
3.3 Diffusion Transformers (DiT): Architecture Overview
3.4 Conditioning in Diffusion Transformers
3.5 Comparison: U-Nets vs. Transformers for Diffusion
3.6 Implementation Considerations for DiTs
3.7 Hands-on Practical: Building a Simple DiT Block
© 2025 ApX Machine Learning