Saving and packaging fine-tuned models is essential for reliable deployment and sharing, especially after optimization techniques like quantization or pruning have been applied, or PEFT adapters have been merged. Proper serialization ensures that the model's learned parameters, configuration, and associated components like tokenizers are stored correctly and can be loaded consistently in different environments. Packaging bundles the model with its dependencies and documentation, making management and deployment simpler.
While various serialization formats exist, the ecosystem around LLMs has largely converged on a few common practices, primarily driven by libraries like Hugging Face's transformers.
save_pretrained / from_pretrainedThis is arguably the most common method for saving and loading transformer-based models, including fine-tuned LLMs. When you call model.save_pretrained(save_directory), the library typically saves several components:
.bin format (pytorch_model.bin) or increasingly in the safetensors format (model.safetensors).config.json) storing the model's architecture, hyperparameters (like hidden size, number of layers), and other metadata necessary to reconstruct the model structure.vocab.json, merges.txt, sentencepiece.model) and the tokenizer configuration (tokenizer_config.json, special_tokens_map.json) required to correctly process text input for the model.# Example: Saving a model and tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "gpt2" # Or your fine-tuned model path
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# ... fine-tuning or optimization happens here ...
output_dir = "./my_finetuned_model"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model and tokenizer saved to {output_dir}")
# Later, loading the model
loaded_model = AutoModelForCausalLM.from_pretrained(output_dir)
loaded_tokenizer = AutoTokenizer.from_pretrained(output_dir)
print("Model and tokenizer loaded successfully.")
Using save_pretrained ensures that all necessary components are saved in a standardized structure, facilitating easy loading via from_pretrained.
Safetensors (.safetensors) is a secure and fast file format for storing tensors, designed as an alternative to Python's pickle format, which is known to have security vulnerabilities (it can execute arbitrary code). Main advantages include:
safetensors files is safe as it doesn't involve arbitrary code execution.Hugging Face libraries increasingly support and default to safetensors. You can explicitly request it:
# Saving with safetensors explicitly
model.save_pretrained(output_dir, safe_serialization=True)
If a model.safetensors file exists in the directory, from_pretrained will prioritize loading it over pytorch_model.bin.
ONNX provides a standardized format for representing machine learning models. Exporting your fine-tuned LLM to ONNX can offer benefits like:
Exporting usually involves tracing the model with sample inputs. Libraries like transformers.onnx provide utilities to simplify this process.
# Example of exporting to ONNX
from pathlib import Path
from transformers.onnx import export, FeaturesManager
# Assume 'model' and 'tokenizer' are loaded
onnx_output_dir = Path("./my_finetuned_model_onnx")
onnx_output_dir.mkdir(parents=True, exist_ok=True)
# Get the feature (e.g., 'causal-lm') and configuration
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature='causal-lm')
onnx_config = model_onnx_config(model.config)
# Export the model
onnx_inputs, onnx_outputs = export(
tokenizer=tokenizer,
model=model,
config=onnx_config,
output=onnx_output_dir / "model.onnx",
opset=onnx_config.default_onnx_opset # e.g., 13 or higher
)
print(f"Model exported to ONNX format at {onnx_output_dir}")
Challenges can arise with dynamic shapes, custom operations, or ensuring compatibility with the target ONNX runtime version. Thorough testing after export is necessary.
If you used Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA or QLoRA, serialization requires special attention because you typically only saved the adapter weights, not the entire model.
Saving Adapters: PEFT libraries usually have their own saving mechanisms. For Hugging Face's peft, you save the adapter configuration and weights:
# Assuming 'peft_model' is your model with adapters
adapter_output_dir = "./my_lora_adapter"
peft_model.save_pretrained(adapter_output_dir)
print(f"Adapter saved to {adapter_output_dir}")
# This saves adapter_model.bin (or .safetensors) and adapter_config.json
Loading Adapters: To use the adapter, you first load the base pre-trained model and then load the adapter weights onto it:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model_name = "gpt2" # The original base model
adapter_dir = "./my_lora_adapter"
# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name) # Often use base tokenizer
# Load the adapter onto the base model
inference_model = PeftModel.from_pretrained(base_model, adapter_dir)
inference_model.eval() # Set to evaluation mode
print("Base model loaded with PEFT adapter.")
This approach keeps the base model weights separate, allowing multiple adapters to be potentially used with the same base model instance.
Merging and Saving: As discussed in the previous section ("Merging PEFT Adapters"), you can merge the adapter weights into the base model's weights. After merging, you obtain a standard transformers model object. You can then save this merged model using the standard model.save_pretrained(output_dir) method. This simplifies deployment as you only need to manage one set of model artifacts, but you lose the modularity of separate adapters.
Saving the model weights and configuration is only part of the story. Proper packaging involves bundling everything required for the model to run correctly and be understood by others.
requirements.txt file or an environment specification file (like Conda's environment.yml). This includes libraries like torch, transformers, peft, sentencepiece, protobuf (for ONNX), etc. Inconsistent dependency versions are a frequent source of errors.config.json, tokenizer_config.json, etc.) generated during save_pretrained are included alongside the model weights.README.md file formatted as a Model Card. This file should document:
Dockerfile defines the steps to build this image. This creates a self-contained, reproducible environment that simplifies deployment across different systems.Components typically included when packaging a fine-tuned LLM for deployment. PEFT adapters might be included separately or merged into the main model weights. Containerization provides an isolated deployment environment.
By carefully serializing your optimized model and packaging it with its dependencies and documentation, you create a reliable and portable artifact ready for integration into downstream applications or sharing within the community. This systematic approach minimizes deployment friction and ensures reproducibility.
Cleaner syntax. Built-in debugging. Production-ready from day one.
Built for the AI systems behind ApX Machine Learning
Was this section helpful?
save_pretrained and from_pretrained methods within the Hugging Face Transformers library.safetensors format, emphasizing its security and performance advantages over traditional serialization methods.© 2026 ApX Machine LearningEngineered with