Automatic Mixed Precision for Deep Learning, NVIDIA Developer Documentation, 2023 (NVIDIA) - Explains the principles and benefits of using mixed precision (FP16 and FP32) in deep learning, crucial for understanding memory and performance optimizations on GPUs.
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Andrey Bochenin, Vitaly Tarasov, Andrew Karpov, Dianne Jouppi, Anujan Varma, Gabriel Micha, 20182018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE)DOI: 10.1109/CVPR.2018.00696 - A foundational academic paper introducing techniques for quantizing neural networks to 8-bit integers (INT8) for efficient inference, directly relevant to memory reduction.
VRAM estimation for large models, Hugging Face documentation contributors, 2023 (Hugging Face) - A practical guide from a leading LLM platform detailing how to estimate VRAM requirements for large language models, including considerations for different data types and additional memory overheads.