Masterclass
Training large language models is a resource-intensive process that often spans days or even weeks on large compute clusters. Given this duration, the probability of encountering hardware failures, software glitches, or unexpected interruptions becomes significant. Without a mechanism to save progress, such interruptions could force training to restart from scratch, wasting substantial compute resources and time.
This chapter addresses the practical necessity of fault tolerance in large-scale training through checkpointing. We will cover the methods required to periodically save the complete state of a training job. You will learn:
19.1 The Need for Checkpointing in Long Training Runs
19.2 Saving Model State (Weights, Optimizer States)
19.3 Handling Distributed Checkpointing
19.4 Asynchronous vs Synchronous Checkpointing
19.5 Checkpoint Frequency and Storage Management
19.6 Resuming Training from Checkpoints
© 2025 ApX Machine Learning