内存优化技术

这部分内容有帮助吗？

参考文献

Training Deep Nets with Sublinear Memory Cost, Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 2016 arXiv preprint arXiv:1604.06174 (arXiv) DOI: 10.48550/arXiv.1604.06174 - 一篇基础论文，介绍了梯度检查点技术，通过重新计算中间激活而不是存储它们来减少深度神经网络训练期间的内存消耗。
Mixed-Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018 ICLR 2018 DOI: 10.48550/arXiv.1710.03740 - 详细阐述混合精度训练方法论的开创性论文，包括使用16位浮点数和损失缩放以提高训练速度和减少内存使用。
Automatic Mixed Precision package - torch.cuda.amp, PyTorch Contributors, 2024 (PyTorch) - PyTorch 官方文档，提供了使用 torch.cuda.amp 包实现混合精度训练的实用指南和示例，包括 GradScaler 损失缩放的细节。
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost, Noam Shazeer, Mitchell Stern, 2018 Proceedings of the 35th International Conference on Machine Learning (ICML), Vol. 80 (PMLR) DOI: 10.5555/3295304.3295415 - 介绍了 Adafactor，一种自适应学习率优化器，旨在显著减少优化器状态的内存消耗，使其适用于训练非常大的模型。
8-bit Optimizers via Block-wise Quantization, Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer, 2021 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.2110.02861 - 提出一种将优化器状态量化为8位精度的方法，大幅减少Adam等优化器的内存占用，同时保持其性能。