AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 2023arXiv preprint arXiv:2306.00978DOI: 10.48550/arXiv.2306.00978 - This work introduces an activation-aware weight quantization method that preserves only a small percentage of salient weights at full precision, mitigating accuracy loss in low-bit quantization for LLMs.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022NeurIPS 2022DOI: 10.48550/arXiv.2208.07339 - This research identifies and addresses the challenge of emergent outliers in LLM activations, providing a method that selectively handles these outliers to maintain accuracy during quantization, which is applicable to mixed-precision strategies.