Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.
Was this section helpful?
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - Establishes the Transformer architecture, the foundation for LLMs and the self-attention mechanism, which underpins the memory and computational issues in autoregressive decoding.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022Advances in Neural Information Processing Systems 36 (NeurIPS 2022)DOI: 10.48550/arXiv.2205.14135 - Presents an optimized attention algorithm that significantly reduces memory bandwidth usage and computational overhead, addressing a key bottleneck in autoregressive Transformer decoding.