Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1701.06538 - Introduces the sparsely-gated Mixture-of-Experts (MoE) layer, which is the architectural foundation requiring expert parallelism and All-to-All communication.
NVIDIA Collective Communications Library (NCCL) Developer Guide, NVIDIA Corporation, 2023 (NVIDIA) - Official documentation for NVIDIA's NCCL, detailing its optimized collective communication primitives including ncclAllToAll, for high-performance GPU-based distributed training.