An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2010.11929 - 引入 Vision Transformer (ViT) 架构的开创性论文,详细介绍了图像分块、线性嵌入、类别令牌和用于图像分类的 Transformer 编码器等组件。
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30DOI: 10.48550/arXiv.1706.03762 - 引入 Transformer 架构的奠基性论文,ViT 大量借鉴了该架构,特别是其带有多头自注意力机制和前馈网络的编码器结构。
A Survey of Vision Transformers, Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, Dacheng Tao, 2022IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 45 (IEEE)DOI: 10.1109/TPAMI.2022.3152247 - 一份全面的综述,提供了 Vision Transformer 架构、其变体、应用和挑战的概览,补充了原始 ViT 论文之外的背景信息。
Vision Transformer, Aston Zhang, Zachary C. Lipton, Mu Li, Alex Smola, 2024Dive into Deep Learning - 一本广受好评的开源深度学习教材中的章节,提供了 Vision Transformer 架构及其组件的结构化和易于理解的解释。