DeepSpeed: System Optimizations for Large-Scale Model Training, Samyam Rajbhandari, Cong Guo, Jeff Rasley, Shaden Smith, Yuxiong He, 2020OSDI '20: 14th USENIX Symposium on Operating Systems Design and Implementation (USENIX Association)DOI: 10.5555/3446702.3446726 - Details system optimizations for large-scale model training, including strategies for managing model states and fault tolerance in distributed environments.
Best practices for machine learning operations (MLOps) on Google Cloud, Google Cloud, 2024 (Google Cloud) - Outlines best practices for MLOps on Google Cloud, including guidance on managing and storing model artifacts like checkpoints for efficient and reliable training workflows.