Quantization, Hugging Face, 2024 (Hugging Face) - Provides guidance for applying quantization, including 8-bit and 4-bit methods, to models within the transformers library.
torch.compile, PyTorch Contributors, 2024 - Official documentation detailing how to use torch.compile to optimize PyTorch models for faster execution.
Text generation strategies, Hugging Face, 2024 (Hugging Face) - Explains various parameters and strategies for text generation in the transformers library, including the use_cache argument.
Open Neural Network Exchange (ONNX), ONNX Community, 2024 - The official website and standard for an open format to represent machine learning models, enabling model portability and optimized inference.