Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - Fundamental paper introducing the Transformer architecture, which forms the basis for modern LLMs and their activation generation.
Hardware Requirements and Memory Considerations for Transformers, Hugging Face, 2024 - Official documentation providing practical advice on memory considerations and optimization for running Transformer models, including aspects like batch size and hardware requirements.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022Advances in Neural Information Processing Systems 35 (NeurIPS 2022)DOI: 10.48550/arXiv.2205.14135 - Introduces an optimized attention algorithm that significantly reduces HBM (GPU memory) reads/writes and memory footprint of activations, demonstrating the impact of architecture on VRAM.