Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017arXiv preprint arXiv:1701.06538DOI: 10.48550/arXiv.1701.06538 - Introduces the sparsely-gated Mixture-of-Experts (MoE) layer, detailing its architecture, top-k routing, and the auxiliary loss for load balancing. This paper provides a basis for the MoE layer discussed.
torch.nn - A powerful building block, PyTorch Documentation, 2024 - Official PyTorch documentation for the torch.nn module, which is used for building neural network layers and models such as the Expert and MoELayer classes. Provides detailed API reference and usage examples.