VL2: A Scalable, Commodity Data Center Network Architecture, Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, Sudipta Sengupta, 2009ACM SIGCOMM, Vol. 39 (Association for Computing Machinery)DOI: 10.1145/1592568.1592576 - A seminal academic paper that introduces the foundational concepts of modern data center network topologies, particularly the Clos network (leaf-spine), which is critical for scalable AI infrastructure.
A Survey on Communication Optimization in Distributed Deep Learning, Haozhao Wang, Jinsong Wu, Shuo Han, Jie Chen, Guoyong Cai, Bin Wu, Yongmei Zhu, 2022ACM Computing Surveys, Vol. 55 (Association for Computing Machinery)DOI: 10.1145/3547372 - Offers a current and comprehensive overview of communication challenges and optimization strategies in distributed deep learning, directly relevant to minimizing the communication overhead.