TVM: An End-to-End Optimizing Compiler for Deep Learning, Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, 201813th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (USENIX Association)DOI: 10.5555/3342371.3342416 - This paper introduces TVM, a deep learning compiler stack that optimizes and deploys models across diverse hardware backends, focusing on heterogeneous execution and hardware-specific optimizations.
A Survey on Deep Learning System Optimization: From Algorithm to Hardware, Yujun Chen, Shuchang Sheng, Xiaolong Ma, Peiyan Dong, Zhiye Tang, Junru Zhao, Yuanjie Xie, and Shaoli Liu, 2020Proceedings of the IEEE, Vol. 108 (IEEE)DOI: 10.1109/JPROC.2020.2991739 - This survey provides a comprehensive overview of various optimization techniques in deep learning systems, including extensive discussions on heterogeneous scheduling, communication optimization, and memory management.
GSPMD: General and Efficient Parallelism for ML Workloads, Yuanzhong Xu, Jiri Simsa, Jeremy Smith, D. J. Bernstein, Yuan Zhang, Sarah Sirajuddin, and Anna Goldie, 2021Proceedings of Machine Learning and Systems (MLSys '21), Vol. 3 (MLSys)DOI: 10.48550/arXiv.2105.04694 - This paper introduces GSPMD, a system for automatic and efficient parallelism of ML workloads across heterogeneous devices, incorporating advanced strategies for data movement and device placement.
Ansor: Generating High-Performance Tensors with N-Dimensional Search Space Auto-Tuning, Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, Ion Stoica, 202014th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (USENIX Association)DOI: 10.5555/3474345.3474397 - This paper details an auto-tuning framework for deep learning compilers that dynamically optimizes tensor programs for heterogeneous hardware, relying on accurate performance models and hardware characterization.