Hydra: Understanding and Improving Distributed Checkpointing in Deep Learning, Karki, Hritik and Narayanasamy, Sanjeev and Shah, Nirmit and Chen, Bo and Wang, Yuandong and D'Sa, Renju and Agarwal, Sachin and Chen, Chien-Chung and Chintapalli, Srinivas, 2022SC'22: International Conference for High Performance Computing, Networking, Storage and Analysis (ACM)DOI: 10.1145/3550209.3552097 - 该论文分析并提出了分布式检查点方法的改进方案,包括管理一致性和性能权衡的策略,直接解决了同步与异步选择的难题。
Tesseract: A Two-Level Checkpointing Protocol for Large-Scale Deep Learning, Kang, Yu and Zhang, Peifeng and Yang, Hong and Wang, Jiaqi and Zhu, Yanyuan and Wu, You and Zhang, Wei and Liu, Yong and Tian, Jin, 2023Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (Association for Computing Machinery)DOI: 10.1145/3575693.3575702 - 介绍了一种新颖的两级检查点协议,旨在提高大规模深度学习的效率,通过结合同步和异步方法的优势来应对它们的挑战。