An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit and Neil Houlsby, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2010.11929 - Introduces the Vision Transformer (ViT), demonstrating how transformers can be adapted for image data through patchification and sequence processing.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NeurIPS)DOI: https://doi.org/10.48550/arXiv.1706.03762 - Introduces the transformer architecture, laying the groundwork for self-attention mechanisms and sequence-to-sequence models without recurrence or convolutions.