FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Aware Operators, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2205.14135 - Presents the FlashAttention algorithm, which significantly improves the speed and memory efficiency of attention mechanisms by reordering computations and minimizing memory I/O, particularly relevant for LLMs.
vLLM: Universal and Efficient Engine for Large Language Model Inference, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023SOSP 2023DOI: 10.48550/arXiv.2309.06180 - Introduces PagedAttention, a key memory management technique that optimizes KV cache usage, allowing for higher throughput and reduced memory fragmentation in LLM inference.
QLoRA: Efficient Finetuning of Quantized LLMs on Consumer GPUs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023arXiv preprint arXiv:2305.14314DOI: 10.48550/arXiv.2305.14314 - Details the QLoRA method, which relies on 4-bit NormalFloat (NF4) quantization and includes optimized CUDA kernels, often implemented through libraries like bitsandbytes, for efficient computation of quantized models.
NVIDIA TensorRT Documentation, NVIDIA Corporation, 2024 (NVIDIA) - Provides information on the optimization techniques used in TensorRT, including kernel fusion, graph optimizations, and hardware-specific tuning, which are applied to quantized models for efficient deployment.