An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2010.11929 - This paper introduced the Vision Transformer (ViT) architecture, demonstrating how Transformers, previously dominant in natural language processing, could be effectively applied to image classification, forming a basis for comparing ViTs with CNNs.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo, 2021IEEE International Conference on Computer Vision (ICCV)DOI: 10.48550/arXiv.2103.14030 - This work introduced a hierarchical Vision Transformer that addresses the quadratic computational complexity of standard ViTs by computing self-attention within local windows, making it more efficient for various vision tasks and high-resolution images.
Transformers in Vision: A Survey, Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah, 2021ACM Computing Surveys, Vol. 54DOI: 10.1145/3505244 - This survey offers a comprehensive overview of Vision Transformers, including a comparison with convolutional neural networks, discussing their architectural differences, performance, and applications.