An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2010.11929 - Introduces the Vision Transformer (ViT) architecture, which serves as the base for applying Mixture-of-Experts to image data.
Vision MoE: An Empirical Study of Scaling Laws for MoE in Vision, William Fedus, Jeff Dean, Zhifeng Chen, Yuanzhong Xu, Anna Goldie, Basil Mustafa, Anushan Fernando, George Tucker, Yonghui Wu, David So, Blake Hechtman, Barret Zoph, David R. So, Aditya Sharma, Hieu Pham, Quoc V. Le, Paul Barham, Daniel N. Freeman, Albin Cassirer, Jiantao Jiao, Shibo Wang, Claire Cui, Ewa Dominowska, H. Yang, A. Mirhoseini, 2022International Conference on Machine Learning (ICML)DOI: 10.48550/arXiv.2203.05605 - Investigates the application of Mixture-of-Experts to Vision Transformers, detailing scaling and performance for large vision models.