MoE in Vision Transformers (ViT)

Was this section helpful?

References

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, 2020 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.2010.11929 - Introduces the Vision Transformer (ViT) architecture, which serves as the base for applying Mixture-of-Experts to image data.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1701.06538 - Presents the foundational concept of the sparsely-gated Mixture-of-Experts layer, a core component for modern MoE architectures.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, 2021 arXiv preprint, Vol. 23 DOI: 10.48550/arXiv.2101.03961 - Describes the Switch Transformer architecture, demonstrating how MoE can achieve massive parameter counts while maintaining computational efficiency.
Vision MoE: An Empirical Study of Scaling Laws for MoE in Vision, William Fedus, Jeff Dean, Zhifeng Chen, Yuanzhong Xu, Anna Goldie, Basil Mustafa, Anushan Fernando, George Tucker, Yonghui Wu, David So, Blake Hechtman, Barret Zoph, David R. So, Aditya Sharma, Hieu Pham, Quoc V. Le, Paul Barham, Daniel N. Freeman, Albin Cassirer, Jiantao Jiao, Shibo Wang, Claire Cui, Ewa Dominowska, H. Yang, A. Mirhoseini, 2022 International Conference on Machine Learning (ICML) DOI: 10.48550/arXiv.2203.05605 - Investigates the application of Mixture-of-Experts to Vision Transformers, detailing scaling and performance for large vision models.