vLLM: Efficient LLM Serving with PagedAttention, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lichao Yu, Daniel Friedman, Xin Jin, Joseph E. Gonzalez, 2023Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) (NeurIPS)DOI: 10.48550/arXiv.2309.06180 - 介绍了PagedAttention,这是一种对LLM服务中KV缓存管理至关重要的内存优化技术,并提出了vLLM,一个高效的服务系统。
A Survey of Large Language Model Acceleration, Lianmin Zheng, Ying Sheng, Hanwen Chang, Wei-Ming Chen, Xiangru Lian, Zhaoyang Zhang, Zhiqiang Xie, Puxin Xu, Yiyuan Dong, Renjie Liu, Xingyao Chen, Hao Zhang, Kaiwen Zhang, Zhuohan Li, Zixuan Wu, Siyuan Zhuang, Joseph Gonzalez, Yi Wu, Michael Mahoney, Archita Sharma, Fan Lai, Yinghui Li, Junjie Liu, Chris Van Durme, Guangxuan Song, Shangguang Wang, Wen-mei Hwu, Yonghong Yan, Zhi Yang, Zhenglian Wu, Yuandong Tian, Zhiruo Wang, Haotian Tang, Hantian Ding, Michael Jordan, Dawn Song, Michael I Jordan, 2024arXiv preprintDOI: 10.48550/arXiv.2312.15166 - 全面回顾了加速大型语言模型的技术,涵盖了内存优化、计算效率和系统设计,与服务挑战相关。