TVM: An Automatic End-to-End Optimizing Compiler for Deep Learning, Tianqi Chen, Lianmin Li, Ekanathan Palamadai, Ziheng Jiang, Haichen Shen, Agustín Cosío, Jonathan Romero, Luis Vega, Jared Roesch, Zhiqiang Xie, Sheng Zha, Yuwei Hu, Haonan Li, Mu Li, Chris Zhu, Jaraslaw Zola, Steven Lyubomirsky, Shizhi Tang, Alec Plunkett, Animesh Agarwal, Amir Gholami, Younes Fraout, Andrew Psota, Alex R. Shinn, Justin Rising, Hongyi Xin, Young-min Kim, Vinod Grover, Bo Dong, Jon F. O'Boyle, Yuandong Tian, Yida Wang, Thierry Moreau, Zhihao Jia, Zachary DeVito, Michael O'Connor, Wei Chen, Deepak Kumar, Masahiro Tanaka, Yi Yang, Junjie Bai, Joshua Auerbach, Michael Garland, Jeff Dean, Jonathan Frankle, Greg Striemer, Chris Lattner, Zachary L. Li, 2019ACM Transactions on Architecture and Code Optimization (TACO), Vol. 16 (ACM)DOI: 10.1145/3342048.3301047 - Explains the architecture and optimizations of a representative deep learning compiler, illustrating how high-level operations are transformed into low-level kernels.
XLA: Optimizing Compilations, TensorFlow Authors, 2024 (TensorFlow) - Describes how XLA transforms computation graphs, providing background on the compilation process that complicates profiling.
NVIDIA Nsight Compute Documentation, NVIDIA Corporation, 2024 (NVIDIA) - Offers documentation for a leading GPU profiler, showing the type of low-level hardware metrics available and the challenge of relating them to high-level ML constructs.
Demystifying GPU Performance for Deep Learning, Zhihao Jia, Yi Chung, Liqiang Xie, Yuanzhou Yang, Yuandong Tian, and Mu Li, 2018Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18) (ACM)DOI: 10.1145/3178243.3178253 - Presents methods for analyzing and optimizing deep learning performance on GPUs, which helps readers address the practical challenges of interpreting profiling data.