Horovod: Fast and Easy Distributed Deep Learning in TensorFlow, Alexander Sergeev, Mike Del Balso, 2018arXiv preprint arXiv:1802.05799DOI: 10.48550/arXiv.1802.05799 - This paper presents Horovod, a distributed training framework that uses optimized All-Reduce operations for efficient gradient synchronization, offering significant speedups and ease of use for distributed deep learning.
Dive into Deep Learning, Aston Zhang, Zack C. Lipton, Mu Li, Alex Smola, 2024 (Cambridge University Press) - This online textbook offers a detailed chapter on distributed training, covering synchronous and asynchronous methods, communication patterns, and strategies for improving efficiency, providing practical and theoretical foundations.