QLoRA: Efficient Finetuning of Quantized LLMs on Consumer GPUs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023arXiv preprint arXiv:2305.14314DOI: 10.48550/arXiv.2305.14314 - Presents a memory-efficient method for finetuning large language models using 4-bit NormalFloat quantization, demonstrating a practical mixed-precision approach for LLMs where weights are highly quantized but computations maintain higher precision.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 2023arXiv preprint arXiv:2306.00978DOI: 10.48550/arXiv.2306.00978 - Proposes an activation-aware weight quantization method that preserves salient weights, which naturally leads to mixed-precision strategies by prioritizing different precision levels based on activation sensitivity.