CUDA C++ Programming Guide, NVIDIA Corporation, 2023 (NVIDIA Corporation) - 提供了NVIDIA GPU架构和编程模型的全面细节,对理解GPU优化至关重要。
A Domain-Specific Architecture for Deep Neural Networks, Norman P. Jouppi, Cliff Young, David Patil, Motohiro Saito, Edward Horowitz, Jeremy Ren, Anna Chen, Zhiyu Li, David Patterson, Ganesh Venkataramanan, Greg Judah, Kevin LaFayette, Aloke Singh, Rajankumar Vaidya, Felix Von Dollen, James Wilcox, Jonathon Wimmer, Brian Yoon, 2017Communications of the ACM, Vol. 60 (Association for Computing Machinery)DOI: 10.1145/3133917 - 描述了谷歌第一代张量处理单元(TPU)的架构,是特定领域ML ASIC的基础论文。
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning, Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, 2018Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18) (USENIX Association) - 介绍了TVM,这是一个为多种硬件后端优化和生成代码的深度学习编译器框架,展示了ML加速器面临的编译器设计挑战。