NVIDIA H100 Tensor Core GPU Architecture, NVIDIA Corporation, 2022 (NVIDIA) - Provides architectural details on Hopper GPUs, including FP8 formats (E4M3, E5M2), Tensor Cores, and matrix multiply-accumulate instructions for low-precision inference.
Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel Corporation, 2023 Vol. 1 (Intel Corporation) - Details optimization techniques and instruction sets, including AVX512-VNNI instructions (like VPDPBUSD) for accelerating integer dot products on CPUs.
MLIR: A Compiler Infrastructure for the End of Moore's Law, Chris Lattner, Vinay Pidatala, Mehdi Amini, and Albert Cohen, 2021ACM Transactions on Programming Languages and Systems (TOPLAS), Vol. 43 (ACM)DOI: 10.1145/3475737 - Introduces the MLIR compiler infrastructure, which provides multi-level IRs and dialects suitable for representing and transforming high-level machine learning operations down to target-specific low-precision instructions.
Deep Learning Compilers: A Comprehensive Survey, Yongwei Zhao, Kaiqi Chen, Peng Li, Jiasheng Xu, Pengcheng Wang, and Shuaiwen Leon Song, 2023ACM Computing Surveys, Vol. 55 (Association for Computing Machinery)DOI: 10.1145/3544547 - Provides a broad overview of deep learning compilers, including discussions on intermediate representations, optimization techniques, and hardware backends relevant to generating efficient low-precision kernels.