Integrating conditioning information is fundamental for guiding the generative process of diffusion models, allowing us to control the output based on specific inputs like class labels, text descriptions, or other guiding signals. While Chapter 2 discussed conditioning within U-Net architectures, Diffusion Transformers (DiTs) offer distinct mechanisms tailored to the transformer's structure, primarily leveraging the manipulation of token embeddings and modifications within the transformer blocks themselves.
Unlike U-Nets where conditioning is often injected via concatenation in the input layer or through cross-attention mechanisms interspersed with convolutional blocks, DiTs integrate conditioning more intrinsically within their core building blocks. Since DiTs operate on sequences of image patch embeddings, conditioning signals are typically processed and then used to influence the computations within each transformer block, applied uniformly across all patch tokens for global guidance.
Several strategies have proven effective for incorporating conditioning into DiT architectures. The choice often depends on the type of conditioning signal and the desired level of control.
A highly effective method, introduced in the original DiT paper ("Scalable Diffusion Models with Transformers" by Peebles and Xie), involves modulating the transformer block's internal activations using the conditioning information. This approach is often referred to as AdaLN-Zero.
It works as follows:
This approach allows the conditioning signal to dynamically adjust the feature scales and shifts throughout the network, effectively steering the denoising process towards the desired condition.
Diagram illustrating the AdaLN-Zero conditioning mechanism within a Diffusion Transformer (DiT) block. Time and condition embeddings are combined and processed by an MLP to generate modulation parameters (α,β), which then influence the main data path after normalization layers.
A simpler alternative is to inject conditioning directly into the input sequence fed to the transformer.
While simpler, these methods might not propagate the conditioning signal as effectively through the network's depth compared to adaptive normalization, which repeatedly reinforces the condition in every block.
Although the canonical DiT architecture primarily relies on self-attention, introducing cross-attention layers is another viable approach, especially for complex conditioning like text descriptions.
In this setup, specific layers within the transformer blocks would incorporate cross-attention where:
This allows each image patch representation to directly attend to relevant parts of the conditioning information. This closely mirrors the mechanism used in Stable Diffusion's U-Net but adapted for a transformer backbone. Implementing this requires modifying the standard DiT block to include these cross-attention layers, potentially increasing computational cost but offering fine-grained alignment between the image and the condition.
By employing these techniques, particularly adaptive normalization schemes like AdaLN-Zero, Diffusion Transformers can effectively incorporate conditioning information, enabling the generation of high-fidelity images guided by various inputs like class labels or potentially more complex modalities. This adaptability is a significant factor contributing to their scalability and strong performance on large-scale image generation tasks.
© 2025 ApX Machine Learning