Mathematical Formulation of the MoE Layer

Was this section helpful?

References

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, 2022 Journal of Machine Learning Research, Vol. 23 (Microtome Publishing) DOI: 10.55979/v23/21-0990 - Presents the Switch Transformer architecture, which significantly advanced the practical application of sparse MoEs, discussing Top-K gating (k=1 and k=2) and load balancing strategies.