LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022NeurIPS 2022DOI: 10.48550/arXiv.2208.07339 - This work specifically addresses the challenges of quantizing large language models (LLMs) based on the Transformer architecture to 8-bit precision, offering practical methods for quantizing Linear layers while handling activation outliers. It is highly relevant to the section's focus on LLM layers.
Quantization for PyTorch Models, PyTorch Documentation, 2019 (PyTorch Foundation) - This official documentation details the implementation of post-training quantization in PyTorch, covering how to apply it to different layer types like Linear and Embedding layers, and explaining concepts such as calibration, observer usage, and QNNPACK backend. It's an excellent resource for practical application.