Building upon the adaptations of transformers for image data, such as the Vision Transformer (ViT), the Diffusion Transformer (DiT) architecture represents a significant shift in designing the backbone network for diffusion models. Proposed by Peebles and Xie (2022), DiT replaces the commonly used convolutional U-Net with a pure transformer architecture, demonstrating strong performance and scalability, particularly for image generation tasks.
The core idea is to treat the diffusion process, which operates on noisy images xt at timestep t, as a sequence modeling problem suitable for transformers. Instead of relying on convolutional layers' inductive biases (like locality and translation equivariance), DiT leverages the transformer's ability to model long-range dependencies across the image.
Input Processing: Preparing Images for the Transformer
Like ViT, DiT cannot directly process raw pixel grids. The input noisy image xt∈RH×W×C must first be converted into a sequence of tokens:
- Patchification: The input image xt is divided into a grid of non-overlapping patches. For an image of height H and width W, using a patch size P×P results in N=(H×W)/P2 patches.
- Linear Embedding: Each patch is flattened into a vector and then linearly projected into a token embedding of dimension D. This creates an initial sequence of N tokens: z0∈RN×D.
- Positional Embeddings: Since transformers are permutation-invariant, positional information must be explicitly added. Standard learned 1D or 2D positional embeddings Epos∈RN×D are added to the patch embeddings: z0′=z0+Epos. This sequence z0′ forms the input to the transformer blocks.
Core Transformer Blocks
The main body of the DiT consists of a series of L transformer blocks. Each block processes the sequence of tokens, refining the representation. A standard DiT block typically includes:
- Layer Normalization (LN): Applied before the attention or MLP layers to stabilize activations.
- Multi-Head Self-Attention (MHSA): Allows each token to attend to all other tokens in the sequence, capturing global relationships between image patches.
- Layer Normalization (LN): Applied again before the feed-forward network.
- Feed-Forward Network (FFN): Usually a simple Multi-Layer Perceptron (MLP) with two linear layers and a non-linear activation function (e.g., GeLU), applied independently to each token.
Residual connections are used around both the MHSA and FFN sub-layers, ensuring smooth gradient flow and enabling the training of deep transformers.
zl′′=MHSA(LN(zl−1′))+zl−1′zl′=FFN(LN(zl′′))+zl′′
Where zl−1′ is the input to block l, and zl′ is the output.
Conditioning: Incorporating Timestep and Context
A diffusion model must be conditioned on the current timestep t and potentially other contextual information c (like class labels, text embeddings, etc.). DiT needs mechanisms to inject this conditioning information into the transformer blocks.
- Time and Condition Embeddings: The timestep t is first converted into a vector embedding et, often using sinusoidal embeddings followed by an MLP. Similarly, context c is mapped to an embedding ec.
- Adaptive Layer Normalization (adaLN / adaLN-Zero): This is a prevalent technique in DiTs. Instead of standard Layer Normalization, adaptive normalization layers dynamically compute scale (γ) and shift (β) parameters based on the embeddings et and ec. These parameters modulate the normalized activations within each transformer block, typically before the MHSA and FFN layers.
adaLN(h,et,ec)=(γ)⋅LayerNorm(h)+(β)
where γ and β are produced by projecting et and ec (often concatenated or summed) through linear layers. The "adaLN-Zero" variant initializes the projection layers generating γ and β such that the initial output is close to an identity mapping, which can aid training stability by ensuring conditioning blocks initially behave like residual connections.
- Other Methods: While adaLN is common, alternatives exist:
- Adding et and ec directly to the token sequence (e.g., concatenating them as extra tokens or adding them to each token embedding).
- Using cross-attention mechanisms if the conditioning c is a sequence itself (like text embeddings).
Output Processing: Predicting the Diffusion Target
After processing through L transformer blocks, the output token sequence zL′ needs to be converted back into the format required by the diffusion objective (typically predicting the noise ϵ added at step t, or predicting the original image x0).
- Final Modulation: The conditioning embeddings (et,ec) often modulate the output one last time using a final adaLN layer or similar mechanism applied to zL′.
- Linear Projection: A final linear layer projects each output token embedding back to the dimension of a flattened patch (e.g., P×P×C).
- Unpatchification / Reshaping: The output tokens are rearranged ("unpatched") to reconstruct the final output tensor, which has the same dimensions as the input image, representing the predicted noise ϵθ(xt,t,c) or x0.
Overall Architecture Diagram
The following diagram illustrates the high-level flow of information within a Diffusion Transformer:
Overall data flow in a Diffusion Transformer (DiT). Input image xt is patched and embedded. Transformer blocks process these tokens, modulated by time t and context c embeddings via adaLN. The final tokens are projected and reshaped to produce the diffusion target (e.g., noise ϵθ).
DiTs offer a powerful alternative to U-Nets. By operating on sequences of patches, they can effectively model global image structure. Their scalability, demonstrated by performance improvements with larger models, aligns well with trends observed in large language models, making them a significant architecture in the ongoing development of diffusion-based generative modeling. The next sections will compare DiTs with U-Nets and discuss practical implementation details.