Training a language model is a computationally heavy process that can run for hours or days. Hardware faults, memory limits, or process interruptions can halt the execution unexpectedly. Saving progress at regular intervals prevents data loss and allows you to resume from a known good state. This mechanism is known as checkpointing.
When working with standard supervised learning, a model checkpoint typically includes the entire set of model weights. However, since you are applying Parameter-Efficient Fine-Tuning with techniques like LoRA, the checkpoint behaves differently. Instead of duplicating the massive base model, the checkpoint only stores the updated LoRA adapter weights. This significantly reduces the time and storage required to save the model.
A complete checkpoint contains more than just the adapter weights. It preserves the exact state of the training environment at a specific point in time. This includes the optimizer state, the learning rate scheduler state, the current training step number, and the random number generator states.
Saving these additional elements ensures that if you need to resume training, the mathematical calculations continue exactly as if no interruption occurred. For example, optimization algorithms like AdamW maintain moving averages of gradients. If these momentum buffers are lost, the optimizer must start from scratch, which can temporarily destabilize the training process and alter the convergence path.
Components stored within a training checkpoint during the fine-tuning process.
The Hugging Face Trainer class handles state management automatically. This behavior is governed by specific parameters within the training arguments you defined earlier.
The primary setting is the save strategy, which dictates the frequency of your checkpoints. You can configure the strategy to save at the end of every epoch, at a specific number of steps, or not at all. Setting the strategy to steps is generally recommended for fine-tuning small language models, as epochs can sometimes take too long to complete. If a failure occurs near the end of an epoch without step-based checkpointing, you will lose a significant amount of computation time.
If you choose a step-based strategy, you must define the save steps parameter to instruct the script on exactly how many forward and backward passes to execute before writing to disk.
While PEFT adapters are relatively small, optimizer states are not. Every trainable parameter in your adapter requires tracking values in the optimizer. If your LoRA configuration results in a certain number of trainable parameters, represented as , the AdamW optimizer requires parameters to store its moving averages. Using 32-bit floating point precision, you can calculate the storage requirement in bytes using a simple formula:
If you save a new checkpoint every 100 steps over a long training run, you will quickly exhaust your local disk space. To manage storage effectively, you should configure a total limit. Setting a limit restricts the maximum number of checkpoints kept on the hard drive. Once the limit is reached, the Trainer automatically deletes the oldest checkpoint directory before writing the new one. A common practice is to keep only the two or three most recent checkpoints.
State management proves its value when a training script crashes due to an out-of-memory error or a disconnected remote session. The Trainer allows immediate resumption of the optimization loop.
By passing a boolean flag or a specific directory path to the train method, the script will locate the latest checkpoint directory. It will then load the LoRA weights into the base model, restore the optimizer and scheduler states, and automatically fast-forward the data loader to the correct batch. The training loop then proceeds from the exact step where it was interrupted.
Saving intermediate checkpoints allows you to evaluate different stages of the model later. Sometimes a model reaches its peak performance early in the training loop and begins to overfit the instruction dataset as training continues. By keeping multiple checkpoints, you can load earlier iterations of the weights and test them to find the version that generalizes best to new inputs.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•