U-Net: Convolutional Networks for Biomedical Image Segmentation, Olaf Ronneberger, Philipp Fischer, Thomas Brox, 2015Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015DOI: 10.48550/arXiv.1505.04597 - This foundational paper introduced the U-Net architecture, a common backbone for image generation models, including diffusion models, due to its ability to capture hierarchical features and spatial details.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS 2017), Vol. 30DOI: 10.48550/arXiv.1706.03762 - This paper introduced the Transformer architecture and the self-attention mechanism, which are central to Diffusion Transformers (DiTs) for modeling global dependencies.