AWQ: Activation-aware Weight Quantization for LLM Inference, Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 2023arXiv preprint arXiv:2306.00978DOI: 10.48550/arXiv.2306.00978 - Presents the AWQ algorithm, which quantizes weights selectively based on activation scales to preserve accuracy at low bitrates.
QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer, 2023arXiv preprint arXiv:2305.14314 (arXiv)DOI: 10.48550/arXiv.2305.14314 - Introduces QLoRA and the NF4 (Normalized Float 4-bit) quantization datatype, which is fundamental to bitsandbytes low-bit capabilities.
Hugging Face Transformers Documentation, Hugging Face, 2024 (Hugging Face) - Official documentation for the Hugging Face Transformers library, detailing its functionalities, including model loading and quantization integration.
NVIDIA TensorRT-LLM Documentation, NVIDIA, 2024 (NVIDIA) - Official documentation for NVIDIA's toolkit for optimizing and deploying LLMs on NVIDIA GPUs, supporting various quantization formats.