Scalable Diffusion Models with Transformers, William Peebles, Saining Xie, 2023Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE)DOI: 10.1109/ICCV51070.2023.01168 - Introduces Diffusion Transformers (DiTs) and the AdaLN-Zero conditioning mechanism, which is central to the section's discussion.
High-Resolution Image Synthesis with Latent Diffusion Models, Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer, 2022Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.1109/CVPR52688.2022.00196 - Details cross-attention for text conditioning in diffusion models, offering a comparison point for cross-attention in DiTs and U-Net architectures mentioned in the text.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.)DOI: 10.5555/3295222.3295349 - The foundational paper introducing the Transformer architecture, which forms the basis of Diffusion Transformers.
FiLM: Visual Reasoning with a General Conditioning Layer, Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, Aaron Courville, 2018Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32 (Association for the Advancement of Artificial Intelligence)DOI: 10.1609/aaai.v32i1.11671 - Introduces Feature-wise Linear Modulation (FiLM), a technique that provides a broader context for adaptive normalization methods like AdaLN-Zero used in DiTs.