GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, 2022International Conference on Learning Representations (ICLR) 2023DOI: 10.48550/arXiv.2210.17323 - This paper introduces GPTQ, a post-training quantization method for LLMs, specifically targeting low bit-widths like INT4, which is relevant to the section's discussion on INT4 quantization and its performance implications.
AWQ: Activation-aware Weight Quantization for LLM Inference, Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 2023arXiv preprint arXiv:2306.00978DOI: 10.48550/arXiv.2306.00978 - This paper presents AWQ, another weight quantization technique for LLMs, providing an alternative perspective and methodology to GPTQ for achieving efficient low-bit inference on target hardware.
NVIDIA TensorRT-LLM User Guide, NVIDIA Corporation, 2025 (NVIDIA) - This official documentation provides comprehensive details on optimizing and deploying LLMs for NVIDIA GPUs using TensorRT-LLM, covering aspects like kernel optimization, various quantization formats, and hardware-specific performance tuning.
Intel oneAPI Deep Neural Network Library (oneDNN) Documentation, Intel Corporation, 2024 (Intel Corporation) - This documentation provides details on a highly optimized open-source deep learning library for Intel CPUs, explaining how it leverages instruction set extensions (like AVX-512 VNNI) for efficient deep learning inference, including quantized operations.