Saving and Loading Fine-Tuned Models

After your fine-tuning process completes and the training loss has converged, the model's newly adapted weights exist only in memory. To make your work permanent and usable for inference, you must save this state to disk. This involves more than just saving the raw parameters; a reproducible model artifact also includes the model's configuration and the specific tokenizer used during training.

The `save_pretrained` Method

The Hugging Face Transformers library provides a straightforward method for this: save_pretrained(). This function serializes the model's weights and its configuration file (config.json) into a specified directory. For a complete and self-contained artifact, you should also save the tokenizer to the same directory. This practice ensures that anyone using your model will load it with the exact vocabulary and tokenization rules it was trained with.

For example, a fine-tuned model object named model and its corresponding tokenizer. You can save them as follows:

# Assume 'model' and 'tokenizer' are your fine-tuned objects
output_dir = "./my_finetuned_model"

# Save the model's weights and configuration file
model.save_pretrained(output_dir)

# Save the tokenizer's vocabulary and configuration
tokenizer.save_pretrained(output_dir)

print(f"Model and tokenizer saved to {output_dir}")

Executing this code creates a directory containing several files. The primary components are the model weights (e.g., pytorch_model.bin or model.safetensors), the model configuration (config.json), and the tokenizer files (e.g., tokenizer.json, vocab.json).

The directory structure created by save_pretrained. Storing the model, configuration, and tokenizer together ensures reproducibility.

Managing Checkpoints During Training

Full fine-tuning can be a lengthy and computationally expensive process. A hardware failure or an interruption could cause you to lose hours of progress. To mitigate this risk, it is standard practice to save intermediate versions of your model, known as checkpoints, throughout the training run.

The Hugging Face Trainer API simplifies this through its TrainingArguments. You can configure it to save checkpoints automatically based on either the number of training steps or at the end of each epoch.

For example, to save a checkpoint after every epoch:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./training_checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    # Strategy can be "steps" or "epoch"
    save_strategy="epoch", 
    save_total_limit=2, # Optional: only keep the last 2 checkpoints
    logging_dir='./logs',
)

The Trainer will create subdirectories inside ./training_checkpoints (e.g., checkpoint-500, checkpoint-1000) at the specified intervals. Each subdirectory is a complete, loadable model version, allowing you to resume training or evaluate performance from an intermediate point.

Loading the Model for Inference

Once a model is saved, you can load it back into memory for evaluation or deployment using the corresponding from_pretrained() class method. The AutoModel classes are particularly useful here, as they automatically infer the correct model architecture from the config.json file in the specified directory.

The loading process is symmetric to the saving process:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "./my_finetuned_model"

# Load the fine-tuned model and tokenizer from the directory
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Move the model to the GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

By pointing from_pretrained() to the directory path, the library automatically locates and loads the weights, configuration, and tokenizer files, reconstructing the exact state you saved.

Serialization with SafeTensors

You may have noticed the file model.safetensors in the diagram. This is a modern serialization format designed as a more secure and performant alternative to Python's default pickle format, which is used in pytorch_model.bin. Pickled files can be exploited to execute arbitrary code, posing a security risk.

SafeTensors, in contrast, only stores the tensor data and its metadata, preventing such vulnerabilities. It also enables faster model loading, especially for very large models, because tensors can be loaded directly without intermediate memory allocation. The save_pretrained() method in recent versions of the transformers library often defaults to using safetensors if the library is installed (pip install safetensors). This is a recommended practice for sharing and deploying models.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

Hugging Face Transformers Documentation: Saving and Loading Models, Hugging Face, 2024 (Hugging Face) - Official documentation for saving and loading models and tokenizers, including the save_pretrained and from_pretrained methods, and an overview of checkpointing strategies.
SafeTensors GitHub Repository, Hugging Face and the safetensors contributors, 2022 - GitHub repository and documentation for the safetensors format, explaining its advantages in security and loading speed over pickle for large models.
The Hugging Face Course: Training a Model - Saving your model, Hugging Face, 2024 (Hugging Face) - A dedicated section in the official Hugging Face Course covering model saving, loading, and checkpoint management during training.