vLLM: Efficient LLM Serving with PagedAttention, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lichao Yu, Daniel Friedman, Xin Jin, Joseph E. Gonzalez, 2023Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (NeurIPS)DOI: 10.48550/arXiv.2309.06180 - Introduces PagedAttention, a memory optimization technique critical for managing the KV cache in LLM serving, and presents vLLM, an efficient serving system.
A Survey of Large Language Model Acceleration, Lianmin Zheng, Ying Sheng, Hanwen Chang, Wei-Ming Chen, Xiangru Lian, Zhaoyang Zhang, Zhiqiang Xie, Puxin Xu, Yiyuan Dong, Renjie Liu, Xingyao Chen, Hao Zhang, Kaiwen Zhang, Zhuohan Li, Zixuan Wu, Siyuan Zhuang, Joseph Gonzalez, Yi Wu, Michael Mahoney, Archita Sharma, Fan Lai, Yinghui Li, Junjie Liu, Chris Van Durme, Guangxuan Song, Shangguang Wang, Wen-mei Hwu, Yonghong Yan, Zhi Yang, Zhenglian Wu, Yuandong Tian, Zhiruo Wang, Haotian Tang, Hantian Ding, Michael Jordan, Dawn Song, Michael I Jordan, 2024arXiv preprintDOI: 10.48550/arXiv.2312.15166 - Offers a comprehensive review of techniques for accelerating large language models, addressing memory optimization, computational efficiency, and system design, pertinent to serving challenges.
Large Language Model Inference: The Cost of Waiting, Quentin Lhoest, Lysandre Neis, 2023 (Hugging Face Blog) - Examines the challenges of LLM inference, focusing on the trade-offs between latency and throughput, and clarifies concepts like continuous batching and efficient KV cache management.