QLoRA: Efficient Finetuning of Quantized LLMs on Consumer GPUs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023arXiv preprint arXiv:2305.14314DOI: 10.48550/arXiv.2305.14314 - Introduces 4-bit NormalFloat (NF4) and an efficient finetuning approach for quantized LLMs, which is crucial for efficient deployment and relevant to mixing different numerical formats.
A Survey on Quantization of Neural Networks, Jiya Ren, Jingjing Li, Hongliang Li, Gang Li, Meng Wang, and Jiancheng Lv, 2020IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43 (IEEE)DOI: 10.1109/TPAMI.2020.3006096 - Provides a comprehensive overview of various neural network quantization techniques, including mixed-precision, offering broad context for the topic.