TVM: An End-to-End Optimizing Compiler for Deep Learning, Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, 201813th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (USENIX Association)DOI: 10.5555/3305417.3305467 - This seminal paper introduces TVM, a deep learning compiler stack designed to address the challenges of deploying ML models on diverse hardware, directly relevant to bridging the deployment gap through graph optimizations and hardware-specific code generation.
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Andrey Ogarev, Mark Sandler, Hartwig Adam, and Dmitry Kalenichenko, 2018Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.1109/CVPR.2018.00869 - A widely cited paper from Google that introduced a method for quantizing neural networks to 8-bit integers for efficient inference on various hardware, directly addressing the numerical precision factor mentioned in the gap.
Deep Learning Compilers: A New Horizon for Optimizing AI Workloads, Xiaoke Yu, Jiong Luo, Cheng Li, Yunji Chen, and Tianshi Chen, 2020ACM Computing Surveys, Vol. 53 (Association for Computing Machinery (ACM))DOI: 10.1145/3397943 - This survey provides a comprehensive overview of the design and implementation of deep learning compilers, detailing various optimization techniques used to bridge the performance gap between development and deployment across diverse hardware.