Distributed training with TensorFlow, TensorFlow Authors, 2024 - Provides comprehensive guidance on setting up and using TensorFlow's distributed training strategies, which is essential for understanding and debugging distributed jobs.
Profile TensorFlow performance with the Keras Callback and TensorBoard, TensorFlow Authors, 2024 - Details how to use TensorBoard Profiler for performance analysis in TensorFlow, including distributed scenarios, which is a key debugging tool for identifying bottlenecks and stragglers.
Deep Learning (Chapter 16: Parallel and Distributed Training), Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Offers foundational theoretical and practical considerations for parallel and distributed training, including communication, synchronization, and data/model parallelism challenges that often lead to debugging issues.
Performance Analysis of Distributed Deep Learning Training, Fan Yang, Xiaoxiao Wu, Jinkun Geng, Kaiwei Tu, Zongjian Hu, Jianxun Liu, Xiang Li, and Xiaoming Li, 2020arXiv preprint arXiv:2010.02640DOI: 10.48550/arXiv.2010.02640 - Examines common performance bottlenecks and analysis methods in distributed deep learning, addressing topics like stragglers, communication overheads, and system resource utilization that are central to debugging.