Accounting for Activation Memory

Was this section helpful?

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems 30 (NIPS 2017) DOI: 10.48550/arXiv.1706.03762 - Fundamental paper introducing the Transformer architecture, which forms the basis for modern LLMs and their activation generation.
Deep Learning Systems: Algorithms, Compilers, and Processors for Large-Scale AI, Dawei Chen, Haibin Li, Zidong Zhang, 2023 (Springer) DOI: 10.1007/978-981-99-4704-5 - Provides a comprehensive overview of deep learning systems, including detailed discussions on memory management, hardware considerations, and optimization techniques for large models, pertinent to understanding VRAM usage.
Hardware Requirements and Memory Considerations for Transformers, Hugging Face, 2024 - Official documentation providing practical advice on memory considerations and optimization for running Transformer models, including aspects like batch size and hardware requirements.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022 Advances in Neural Information Processing Systems 35 (NeurIPS 2022) DOI: 10.48550/arXiv.2205.14135 - Introduces an optimized attention algorithm that significantly reduces HBM (GPU memory) reads/writes and memory footprint of activations, demonstrating the impact of architecture on VRAM.