Distributed training with TensorFlow, TensorFlow Developers, 2024 - The official guide to tf.distribute.Strategy, explaining its principles, various strategies, and usage examples with Keras and custom training loops.
Horovod: Fast and Easy Distributed Deep Learning Training, Alexander Sergeev, Mike Del Balso, 2018arXiv preprint arXiv:1802.05799DOI: 10.48550/arXiv.1802.05799 - Introduces Horovod, a distributed training framework that efficiently implements data parallelism using all-reduce, illustrating concepts for synchronous strategies like MirroredStrategy.
Large Scale Distributed Deep Networks, Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marcaurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, Andrew Y. Ng, 2012Advances in Neural Information Processing Systems (NIPS) 25 (Curran Associates, Inc.) - A paper presenting a large-scale distributed training system using a parameter server architecture, relevant for understanding the principles behind ParameterServerStrategy.
A Domain-Specific Architecture for Deep Neural Networks, Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, 2018Communications of the ACM, Vol. 61 (Association for Computing Machinery (ACM))DOI: 10.1145/3154484 - Describes the architecture of Google's Tensor Processing Unit (TPU) and its evolution, giving insight into the hardware optimized for deep learning, directly supporting the TPUStrategy.