Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, 2018Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.1109/CVPR.2018.00696 - Foundational paper introducing techniques for INT8 quantization, including Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), widely adopted in TensorFlow Lite.
NVIDIA TensorRT Developer Guide, NVIDIA, 2024 (NVIDIA) - Official documentation for NVIDIA TensorRT, providing details on how to use its capabilities for model optimization, including INT8 and FP16 quantization for GPU deployment.
Quantization with PyTorch, PyTorch Documentation, 2024 (PyTorch) - The official guide for implementing quantization techniques (PTQ, QAT, FX Graph Mode) using the torch.quantization module, demonstrating framework-specific workflows.