ViT Architecture: Patches, Embeddings, Transformer Encoder
Was this section helpful?
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2010.11929 - The seminal paper introducing the Vision Transformer (ViT) architecture, detailing its components like image patching, linear embeddings, class tokens, and the Transformer encoder for image classification.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30DOI: 10.48550/arXiv.1706.03762 - The foundational paper that introduced the Transformer architecture, which ViT heavily adapts, particularly its encoder structure with multi-head self-attention and feed-forward networks.
A Survey of Vision Transformers, Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, Dacheng Tao, 2022IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 45 (IEEE)DOI: 10.1109/TPAMI.2022.3152247 - A comprehensive review offering a general overview of the Vision Transformer architecture, its variants, applications, and challenges, providing context beyond the original ViT paper.
Vision Transformer, Aston Zhang, Zachary C. Lipton, Mu Li, Alex Smola, 2024Dive into Deep Learning - A chapter from a well-regarded open-source deep learning textbook, providing a structured and accessible explanation of the Vision Transformer architecture and its components.