QLoRA: Efficient Finetuning of Quantized LLMs on Consumer GPUs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023arXiv preprint arXiv:2305.14314DOI: 10.48550/arXiv.2305.14314 - Introduces QLoRA, a memory-efficient finetuning approach that uses 4-bit quantization and paged optimizers, directly relevant to memory optimization and performance characteristics discussed.
NVIDIA Nsight Systems Documentation, NVIDIA Corporation, 2024 (NVIDIA) - Official guide for Nsight Systems, a system-wide profiler for detailed CPU-GPU activity timelines, essential for identifying deep bottlenecks.
LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2021International Conference on Learning Representations (ICLR 2022)DOI: 10.48550/arXiv.2106.09685 - Foundational paper introducing Low-Rank Adaptation (LoRA), providing the basis for understanding the performance and memory characteristics of PEFT methods.