Placement of MoE Layers: Frequency and Location

Was this section helpful?

References

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, and Noam Shazeer, 2022 Journal of Machine Learning Research, Vol. 23 DOI: 10.5555/3544548.3544723 - This paper introduces the Switch Transformer architecture, which features an alternating MoE layer placement strategy to effectively manage computational and communication costs in large-scale models.