NVIDIA CUDA C++ Programming Guide, NVIDIA Corporation, 2023 (NVIDIA) - Comprehensive guide to CUDA programming, detailing asynchronous execution with streams and events, memory management, and optimization techniques for NVIDIA GPUs.
SYCL 2020 Specification, Khronos Group, 2020 (Khronos Group) - Defines the SYCL standard for heterogeneous computing, describing task graphs, queues, events, and host-device synchronization for portable parallel programming.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning, Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, 201813th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (USENIX Association) - Introduces TVM, a deep learning compiler stack that optimizes execution across heterogeneous hardware by transforming and scheduling computation graphs, directly addressing asynchronous execution and resource management for ML workloads.
Programming Massively Parallel Processors: A Hands-on Approach, David B. Kirk, Wen-mei W. Hwu, 2016 (Morgan Kaufmann) - A foundational textbook on GPU programming, offering detailed explanations of parallel architecture, CUDA streams, events, and advanced techniques for optimizing heterogeneous computation and data transfer overlap. (4th edition)