GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, 2023ICLR 2023DOI: 10.48550/arXiv.2210.17323 - This paper introduces GPTQ, an efficient and accurate post-training quantization method specifically designed for large language models. It demonstrates how PTQ can be effectively applied to achieve significant compression with minimal accuracy loss in LLMs, highly relevant for the course's focus.
Deep Learning with Low-Precision Quantization: A Review, Zechun Cai, Dong Huang, Zizheng Pan, Yunhe Wang, Kai Han, Wenshi Zhang, Errui Ding, Yiping Deng, Yubei Chen, and Xiangyu Zhang, 2020Journal of Parallel and Distributed Computing, Vol. 144 (Elsevier)DOI: 10.1016/j.jpdc.2020.05.006 - This comprehensive review article surveys various low-precision quantization techniques for deep learning, covering both post-training quantization (PTQ) and quantization-aware training (QAT). It offers a broad perspective on their methodologies, advantages, and limitations.