An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, 2021International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2010.11929 - 提出了视觉Transformer (ViT) 以及将图像转换为补丁序列以供Transformer处理的方法,这是DiT输入处理的基础。