A Survey of Quantization Methods for Efficient Neural Network Inference, Amir Gholami, Song Han, Sijia Liu, Jianbo Liu, Minhao Cheng, Sehoon Kim, Zhongnan Qu, Hesham Mostafa, William J. Dally, Mahmoud Naghshvari, Kurt Keutzer, 2021Frontiers in Neuroscience, Vol. 15 (Frontiers Media SA)DOI: 10.3389/fnins.2021.647915 - A comprehensive survey of various quantization techniques for neural networks, discussing their motivations and the challenges associated with simpler approaches, including issues like outlier sensitivity.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2208.07339 - This paper identifies and addresses the problem of outliers in large language model activations that severely degrade accuracy during quantization, explaining why basic PTQ struggles with LLMs.