Quantization with Hugging Face Transformers and Accelerate
Was this section helpful?
Hugging Face Transformers Documentation, Hugging Face, 2024 - Official documentation for the Hugging Face Transformers library, detailing its quantization features, including the BitsAndBytesConfig and integration with bitsandbytes for loading quantized models.
Hugging Face Accelerate Documentation, Hugging Face, 2024 (Hugging Face) - Official documentation for the Hugging Face Accelerate library, explaining its capabilities in simplifying distributed execution and automatic device placement, which is crucial for loading large quantized models with device_map='auto'.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, 2022Advances in Neural Information Processing Systems (NeurIPS), Vol. 35DOI: 10.48550/arXiv.2208.07339 - Presents the LLM.int8() quantization method, which is specifically designed to handle outlier features in large language models and is utilized for 8-bit quantization through parameters like llm_int8_threshold.