Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30DOI: 10.48550/arXiv.1706.03762 - The original paper introducing the Transformer architecture and the self-attention mechanism, which is the foundation for KV caching.
Accelerate Inference, Hugging Face, 2024 (Hugging Face) - Official documentation providing practical guidance on optimizing Transformer inference, including how KV caching is implemented and used within the Hugging Face ecosystem.
vLLM: Efficient Memory Management for Large Language Model Serving, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023SOSP 2023DOI: 10.48550/arXiv.2309.06180 - Addresses memory management challenges of KV caching in large language model serving and introduces paged attention as an advanced optimization technique.
Attention Mechanisms and Transformers, Aston Zhang, Zack C. Lipton, Mu Li, Alex Smola, 2023 (Cambridge University Press) - A chapter from an open-source deep learning textbook that clearly explains the Transformer architecture, self-attention, and related concepts in an educational format.