DeepSpeed: Training large models with ease, speed and efficiency, Microsoft, 2024 (Microsoft) - Provides official documentation and tutorials on implementing efficient checkpointing and fault tolerance using the DeepSpeed framework, essential for large-scale distributed training of deep learning models.
FSDP: Fully Sharded Data Parallel, Wei Feng, Will Constable, Yifan Mao, 2024 (PyTorch Foundation) - Offers official PyTorch guidance on using Fully Sharded Data Parallel (FSDP), including how to save and load sharded checkpoints, which is critical for scaling large models.
Machine Learning Design Patterns, Valliappa Lakshmanan, Sara Robinson, and Michael Munn, 2020 (O'Reilly Media) - The book's 'Checkpointing' design pattern chapter provides an architectural perspective on implementing checkpointing for resilience and manageability in machine learning pipelines.
SafeTensors: A more secure and faster way to store and load tensors, Hugging Face, 2024 - Explains the SafeTensors format, which addresses security and efficiency concerns in saving and loading large model states, offering a robust alternative to traditional serialization methods.