All Courses

Conditioning in Diffusion Transformers

Integrating conditioning information is fundamental for guiding the generative process of diffusion models, allowing us to control the output based on specific inputs like class labels, text descriptions, or other guiding signals. While Chapter 2 discussed conditioning within U-Net architectures, Diffusion Transformers (DiTs) offer distinct mechanisms tailored to the transformer's structure, primarily leveraging the manipulation of token embeddings and modifications within the transformer blocks themselves.

Unlike U-Nets where conditioning is often injected via concatenation in the input layer or through cross-attention mechanisms interspersed with convolutional blocks, DiTs integrate conditioning more intrinsically within their core building blocks. Since DiTs operate on sequences of image patch embeddings, conditioning signals are typically processed and then used to influence the computations within each transformer block, applied uniformly across all patch tokens for global guidance.

Methods for Conditioning in DiTs

Several strategies have proven effective for incorporating conditioning into DiT architectures. The choice often depends on the type of conditioning signal and the desired level of control.

1. Adaptive Normalization (AdaLN-Zero)

A highly effective method, introduced in the original DiT paper ("Scalable Diffusion Models with Transformers" by Peebles and Xie), involves modulating the transformer block's internal activations using the conditioning information. This approach is often referred to as AdaLN-Zero.

It works as follows:

Embed Conditioning: The discrete timestep $t$ and the conditioning signal $c$ (e.g., a class label) are first mapped to embedding vectors, $\text{embed}(t)$ and $\text{embed}(c)$ respectively.
Combine Embeddings: These embeddings are often simply added together: $\text{cond\_embedding} = \text{embed}(t) + \text{embed}(c)$ .
Compute Modulation Parameters: This combined embedding is passed through a Multi-Layer Perceptron (MLP) to produce modulation parameters. Crucially, in AdaLN-Zero, these parameters are often denoted as $(\gamma, \beta, \alpha)$ , designed to scale and shift the outputs of Layer Normalization or specific linear layers within the transformer block. The original DiT paper used these to modulate the state after Layer Normalization and before the main MLP or attention layer, and also for the residual connection scaling.
Apply Modulation: Let $h$ be the hidden state within a transformer block. The modulation might look something like this for a specific layer (e.g., MLP or Attention) preceded by Layer Normalization: $h' = \alpha \cdot \text{Layer}(\text{LayerNorm}(h)) + \beta$ Here, $\alpha$ and $\beta$ are derived from the conditioning embedding via the MLP. The parameters are calculated once per block based on $t$ and $c$ , and applied element-wise.

This approach allows the conditioning signal to dynamically adjust the feature scales and shifts throughout the network, effectively steering the denoising process towards the desired condition.

Diagram illustrating the AdaLN-Zero conditioning mechanism within a Diffusion Transformer (DiT) block. Time and condition embeddings are combined and processed by an MLP to generate modulation parameters ( $\alpha, \beta$ ), which then influence the main data path after normalization layers.

2. Input Injection

A simpler alternative is to inject conditioning directly into the input sequence fed to the transformer.

Additional Tokens: The conditioning signal (e.g., a learned embedding for a class, or a sequence of text embeddings) can be treated as one or more additional tokens prepended or appended to the sequence of image patch tokens. The self-attention mechanism can then naturally incorporate information from these conditioning tokens when processing the image tokens.
Embedding Addition: Similar to positional embeddings, the conditioning embedding can be added directly to the image patch embeddings before they enter the first transformer block. This requires the conditioning embedding to have the same dimension as the patch embeddings.

While simpler, these methods might not propagate the conditioning signal as effectively through the network's depth compared to adaptive normalization, which repeatedly reinforces the condition in every block.

3. Cross-Attention

Although the canonical DiT architecture primarily relies on self-attention, introducing cross-attention layers is another viable approach, especially for complex conditioning like text descriptions.

In this setup, specific layers within the transformer blocks would incorporate cross-attention where:

Query: Derived from the image patch tokens (similar to self-attention).
Key and Value: Derived from the conditioning signal embeddings (e.g., output tokens from a text encoder like CLIP).

This allows each image patch representation to directly attend to relevant parts of the conditioning information. This closely mirrors the mechanism used in Stable Diffusion's U-Net but adapted for a transformer backbone. Implementing this requires modifying the standard DiT block to include these cross-attention layers, potentially increasing computational cost but offering fine-grained alignment between the image and the condition.

Implementation Considerations

Embedding Dimension: Ensure conditioning embeddings (class, text, etc.) are projected to a suitable dimension, whether for addition to time embeddings, direct addition to patch embeddings, or use in cross-attention.
Parameter Count: Adaptive normalization adds relatively few parameters (mainly in the MLP generating $\gamma, \beta, \alpha$ ), while adding cross-attention layers can significantly increase model size and computation.
Placement: When using AdaLN-Zero, applying it within each transformer block ensures the condition influences processing at multiple representational levels. For cross-attention, deciding which blocks should include it involves a trade-off between control fidelity and efficiency.

By employing these techniques, particularly adaptive normalization schemes like AdaLN-Zero, Diffusion Transformers can effectively incorporate conditioning information, enabling the generation of high-fidelity images guided by various inputs like class labels or potentially more complex modalities. This adaptability is a significant factor contributing to their scalability and strong performance on large-scale image generation tasks.

Was this section helpful?