A Survey of Machine Learning Accelerators, Mengying Wei, Jacob Reagen, 2021ACM Computing Surveys, Vol. 54 (ACM)DOI: 10.1145/3472061 - Offers a comprehensive survey of machine learning accelerators, covering various hardware types discussed in the section.
CUDA C++ Programming Guide, NVIDIA Corporation, 2023 (NVIDIA Corporation) - Provides comprehensive details on NVIDIA GPU architecture and programming model, essential for understanding GPU optimization.
A Domain-Specific Architecture for Deep Neural Networks, Norman P. Jouppi, Cliff Young, David Patil, Motohiro Saito, Edward Horowitz, Jeremy Ren, Anna Chen, Zhiyu Li, David Patterson, Ganesh Venkataramanan, Greg Judah, Kevin LaFayette, Aloke Singh, Rajankumar Vaidya, Felix Von Dollen, James Wilcox, Jonathon Wimmer, Brian Yoon, 2017Communications of the ACM, Vol. 60 (Association for Computing Machinery)DOI: 10.1145/3133917 - Describes the architecture of Google's first-generation Tensor Processing Unit (TPU), a foundational paper for domain-specific ML ASICs.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning, Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, 2018Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18) (USENIX Association) - Presents TVM, a deep learning compiler framework that optimizes and generates code for diverse hardware backends, illustrating compiler design challenges for ML accelerators.