Convolutional Neural Networks (CNNs), particularly within U-Net structures, have been foundational in diffusion models, achieving remarkable success in image generation. Their strength lies in exploiting spatial locality and translation equivariance through convolutional filters and pooling operations. This inductive bias is highly effective for tasks where local patterns and textures are significant. However, this inherent focus on local neighborhoods can sometimes be a limitation when generating images that require understanding and modeling relationships between distant parts of the scene. Capturing global context, complex compositional structures, or long-range dependencies purely through stacked convolutional layers can be challenging and may require very deep networks.Enter the transformer architecture. Originally developed for sequence processing tasks in natural language processing (NLP), transformers demonstrated an extraordinary ability to model dependencies between elements in a sequence, regardless of their distance. The mechanism powering this capability is self-attention. Unlike convolution, which operates on a fixed local receptive field, self-attention allows every element (e.g., a word in a sentence, or potentially a patch in an image) to directly attend to and weigh the importance of all other elements when computing its representation.digraph G { rankdir=LR; node [shape=box, style=filled, fontname="sans-serif", margin=0.2]; subgraph cluster_cnn { label = "CNN Receptive Field"; bgcolor="#e9ecef"; node [fillcolor="#a5d8ff"]; p_cnn [label="Output Pixel"]; i1 [label="Input\nPatch 1"]; i2 [label="Input\nPatch 2"]; i3 [label="Input\nPatch 3"]; i4 [label="Input\nPatch 4"]; i5 [label="Input\nPatch 5"]; {i2, i3, i4} -> p_cnn [color="#1c7ed6"]; i1 -> i2 [style=invis]; i4 -> i5 [style=invis]; } subgraph cluster_transformer { label = "Transformer Self-Attention"; bgcolor="#e9ecef"; node [fillcolor="#ffc9c9"]; p_tf [label="Output Token"]; t1 [label="Input\nToken 1"]; t2 [label="Input\nToken 2"]; t3 [label="Input\nToken 3"]; t4 [label="Input\nToken 4"]; t5 [label="Input\nToken 5"]; {t1, t2, t3, t4, t5} -> p_tf [color="#f03e3e"]; } }Diagram comparing the local receptive field of a CNN operation versus the global context considered by transformer self-attention for a single output element.Why is this global modeling capability appealing for generative modeling, especially for diffusion models tasked with synthesizing complex data like high-resolution images?Global Coherence: Generating realistic images often requires ensuring consistency across large distances. For example, lighting and shadows should be consistent across an entire scene, or the texture of a large object should be uniform. Self-attention provides a direct mechanism to enforce such long-range coherence by allowing different parts of the image representation to interact directly.Modeling Complex Relationships: Images frequently contain intricate relationships between objects or regions. Transformers might be better suited to learn these complex, non-local interactions compared to CNNs, which build up hierarchical features more gradually.Scalability and Parameter Efficiency: While attention mechanisms can be computationally intensive, particularly for high-resolution images (more on adapting this later), transformers have shown remarkable scaling properties. As models and datasets grow, transformers have often demonstrated continued performance improvements, suggesting potential advantages for large-scale generative tasks. Furthermore, architectures like the Vision Transformer (ViT) have shown that transformers can achieve strong performance, sometimes with fewer parameters than comparable CNNs, although often requiring larger datasets for effective pre-training.Different Inductive Biases: Transformers possess a weaker spatial inductive bias compared to CNNs. While CNNs inherently assume locality and translation equivariance are important, transformers make fewer assumptions about the input structure. This flexibility can be advantageous when the optimal way to process the data is not known a priori or doesn't fit neatly into a convolutional framework. However, this often means transformers require more data or specific pre-training strategies to learn effectively.Therefore, the motivation for exploring transformers in diffusion models stems from the desire to overcome potential limitations of CNNs in modeling global context and long-range dependencies. The success of self-attention in capturing complex relationships in other domains suggests its potential to enhance the quality, coherence, and expressiveness of generative models operating on high-dimensional data like images. The following sections will examine how these powerful architectures are specifically adapted and integrated into the diffusion framework.