Data Parallelism with Synchronous and Asynchronous Updates
Was this section helpful?
Scaling Distributed Machine Learning with the Parameter Server, Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, Bor-Yiing Su, 201411th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14) (USENIX Association) - This seminal paper introduces the parameter server architecture, a distributed framework fundamental for understanding asynchronous data parallelism in large-scale machine learning.
DistributedDataParallel (DDP), PyTorch Team, 2024 - Official documentation for PyTorch's primary synchronous data parallelism module, providing practical guidance for its usage and configuration.
Distributed Machine Learning Systems, Jinjun Cai, Song Guo, Min-Hua Huang, Yang Li, Xiaoyong Li, Wenlong Ma, Yong-Min Wang, Jianxiong Xiao, Xiaochao Wang, Xilong Wu, Junyuan Xie, Chunjing Xu, Shuai Zheng, Wenqiang Zhang, 2021Synthesis Lectures on Data Mining and Knowledge Discovery (Morgan & Claypool Publishers)DOI: 10.2200/S01099ED1V01Y202105DMK019 - A comprehensive book covering various aspects of distributed machine learning systems, including different parallelism strategies, communication patterns, and system designs.