QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023arXiv preprint arXiv:2305.14314DOI: 10.48550/arXiv.2305.14314 - Introduces QLoRA, a memory-efficient finetuning approach that uses 4-bit NormalFloat (NF4) quantization, along with Double Quantization.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 2023MLSys 2024DOI: 10.48550/arXiv.2306.00978 - Proposes AWQ, an activation-aware weight quantization method that selectively quantizes weights based on their impact on activation outliers, demonstrating good accuracy for low-bit LLM inference.