Adapting Transformers for Image Data (ViT, Patch Embeddings)
Was this section helpful?
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and self-attention mechanism, establishing the foundation for sequence processing models.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2010.11929 - Presents the Vision Transformer (ViT) model, detailing the method of converting images into sequences of patches and using a transformer for image classification.
Scalable Diffusion Models with Transformers, William Peebles and Saining Xie, 2023Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)DOI: 10.48550/arXiv.2212.09748 - Introduces Diffusion Transformers (DiTs), demonstrating how ViT-style input processing and transformer architectures can be scaled for state-of-the-art diffusion models.