QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023NeurIPSDOI: 10.48550/arXiv.2305.14314 - Introduces NF4 quantization, a specific ultra-low precision format, and demonstrates its efficient implementation via the bitsandbytes library, directly addressing kernel optimization for novel formats.
NVIDIA TensorRT Documentation, NVIDIA Corporation, 2024 (NVIDIA) - Official documentation for NVIDIA's high-performance deep learning inference optimizer, detailing how it leverages hardware and provides optimized kernels for various data types and quantization schemes.
CUTLASS: CUDA Templates for Linear Algebra Subroutines, NVIDIA Corporation, 2024 (NVIDIA Corporation) - Provides highly optimized C++ kernels for various data types, including low-precision formats, critical for leveraging hardware accelerators like NVIDIA Tensor Cores.