CUDA C++ Programming Guide, NVIDIA, 2024 (NVIDIA) - This official guide provides a detailed account of the CUDA architecture and its parallel programming model, which is central to understanding how GPUs execute many simple operations concurrently.
ImageNet Classification with Deep Convolutional Neural Networks, Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, 2012Advances in Neural Information Processing Systems (NIPS 2012), Vol. 25 (Curran Associates, Inc.) - A landmark paper that showcased the effectiveness of GPUs in accelerating deep learning models, significantly contributing to the widespread adoption of GPUs in AI research and applications.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, 2022NeurIPS 2022DOI: 10.48550/arXiv.2208.07339 - This research paper addresses the practical challenges of running large language models on GPUs, discussing techniques like 8-bit matrix multiplication to improve efficiency, directly relevant to GPU utilization for LLMs.
Computer Architecture: A Quantitative Approach, John L. Hennessy and David A. Patterson, 2017 (Elsevier) - A well-regarded textbook that covers the fundamental principles of computer architecture, including detailed comparisons of CPU and GPU designs and the advantages of parallel processing for different workloads.