Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.
Was this section helpful?
vLLM: Efficient LLM Serving with PagedAttention, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Ion Stoica, 2023arXiv preprint arXiv:2309.06180 (arXiv)DOI: 10.48550/arXiv.2309.06180 - Introduces PagedAttention, a memory management technique for KV cache that significantly improves the efficiency of continuous batching for LLM inference.
Text Generation Inference: A Production Ready Framework for LLM Serving, Olivier Dehaene, Félix Marty, João Gante, Quentin Lhoest, Victor Sanh, et al., 2023 (Hugging Face Blog) - Details the architecture and features of Hugging Face's Text Generation Inference, which includes an optimized continuous batching implementation for production workloads.