ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models, Samyam Rajbhandari, Cong Guo, Erwan Hallou, Anshul Gupta, Olatunji Ruwase, Jeff Dean, 2020SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (ACM)DOI: 10.1145/3433701.3433707 - Presents the ZeRO optimizer, a foundational work for sharding parameters, gradients, and optimizer states, informing FSDP.
Distributed arrays and automatic parallelization in JAX, JAX Authors, 2024 - Official JAX documentation explaining the fundamental mechanisms for distributed computation and automatic parallelization using pmap and collective communication primitives.