After optimizing your fine-tuned model through techniques like quantization or pruning, and potentially merging PEFT adapters, the next step is to save and package it reliably for deployment or sharing. Proper serialization ensures that the model's learned parameters, configuration, and associated components like tokenizers are stored correctly and can be loaded consistently in different environments. Packaging bundles the model with its dependencies and documentation, making it easier to manage and deploy.
While various serialization formats exist, the ecosystem around LLMs has largely converged on a few common practices, primarily driven by libraries like Hugging Face's transformers
.
save_pretrained
/ from_pretrained
This is arguably the most common method for saving and loading transformer-based models, including fine-tuned LLMs. When you call model.save_pretrained(save_directory)
, the library typically saves several components:
.bin
format (pytorch_model.bin
) or increasingly in the safetensors
format (model.safetensors
).config.json
) storing the model's architecture, hyperparameters (like hidden size, number of layers), and other metadata necessary to reconstruct the model structure.vocab.json
, merges.txt
, sentencepiece.model
) and the tokenizer configuration (tokenizer_config.json
, special_tokens_map.json
) required to correctly process text input for the model.# Example: Saving a model and tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "gpt2" # Or your fine-tuned model path
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# ... fine-tuning or optimization happens here ...
output_dir = "./my_finetuned_model"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model and tokenizer saved to {output_dir}")
# Later, loading the model
loaded_model = AutoModelForCausalLM.from_pretrained(output_dir)
loaded_tokenizer = AutoTokenizer.from_pretrained(output_dir)
print("Model and tokenizer loaded successfully.")
Using save_pretrained
ensures that all necessary components are saved in a standardized structure, facilitating easy loading via from_pretrained
.
Safetensors (.safetensors
) is a secure and fast file format for storing tensors, designed as an alternative to Python's pickle
format, which is known to have security vulnerabilities (it can execute arbitrary code). Key advantages include:
safetensors
files is safe as it doesn't involve arbitrary code execution.Hugging Face libraries increasingly support and default to safetensors
. You can explicitly request it:
# Saving with safetensors explicitly
model.save_pretrained(output_dir, safe_serialization=True)
If a model.safetensors
file exists in the directory, from_pretrained
will prioritize loading it over pytorch_model.bin
.
ONNX provides a standardized format for representing machine learning models. Exporting your fine-tuned LLM to ONNX can offer benefits like:
Exporting usually involves tracing the model with sample inputs. Libraries like transformers.onnx
provide utilities to simplify this process.
# Conceptual example of exporting to ONNX
from pathlib import Path
from transformers.onnx import export, FeaturesManager
# Assume 'model' and 'tokenizer' are loaded
onnx_output_dir = Path("./my_finetuned_model_onnx")
onnx_output_dir.mkdir(parents=True, exist_ok=True)
# Get the feature (e.g., 'causal-lm') and configuration
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature='causal-lm')
onnx_config = model_onnx_config(model.config)
# Export the model
onnx_inputs, onnx_outputs = export(
tokenizer=tokenizer,
model=model,
config=onnx_config,
output=onnx_output_dir / "model.onnx",
opset=onnx_config.default_onnx_opset # e.g., 13 or higher
)
print(f"Model exported to ONNX format at {onnx_output_dir}")
Challenges can arise with dynamic shapes, custom operations, or ensuring compatibility with the target ONNX runtime version. Thorough testing after export is necessary.
If you used Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA or QLoRA, serialization requires special attention because you typically only saved the adapter weights, not the entire model.
Saving Adapters: PEFT libraries usually have their own saving mechanisms. For Hugging Face's peft
, you save the adapter configuration and weights:
# Assuming 'peft_model' is your model with adapters
adapter_output_dir = "./my_lora_adapter"
peft_model.save_pretrained(adapter_output_dir)
print(f"Adapter saved to {adapter_output_dir}")
# This saves adapter_model.bin (or .safetensors) and adapter_config.json
Loading Adapters: To use the adapter, you first load the base pre-trained model and then load the adapter weights onto it:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model_name = "gpt2" # The original base model
adapter_dir = "./my_lora_adapter"
# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name) # Often use base tokenizer
# Load the adapter onto the base model
inference_model = PeftModel.from_pretrained(base_model, adapter_dir)
inference_model.eval() # Set to evaluation mode
print("Base model loaded with PEFT adapter.")
This approach keeps the base model weights separate, allowing multiple adapters to be potentially used with the same base model instance.
Merging and Saving: As discussed in the previous section ("Merging PEFT Adapters"), you can merge the adapter weights into the base model's weights. After merging, you obtain a standard transformers
model object. You can then save this merged model using the standard model.save_pretrained(output_dir)
method. This simplifies deployment as you only need to manage one set of model artifacts, but you lose the modularity of separate adapters.
Saving the model weights and configuration is only part of the story. Proper packaging involves bundling everything required for the model to run correctly and be understood by others.
requirements.txt
file or an environment specification file (like Conda's environment.yml
). This includes libraries like torch
, transformers
, peft
, sentencepiece
, protobuf
(for ONNX), etc. Inconsistent dependency versions are a frequent source of errors.config.json
, tokenizer_config.json
, etc.) generated during save_pretrained
are included alongside the model weights.README.md
file formatted as a Model Card. This file should document:
Dockerfile
defines the steps to build this image. This creates a self-contained, reproducible environment that simplifies deployment across different systems.Components typically included when packaging a fine-tuned LLM for deployment. PEFT adapters might be included separately or merged into the main model weights. Containerization provides an isolated deployment environment.
By carefully serializing your optimized model and packaging it with its dependencies and documentation, you create a reliable and portable artifact ready for integration into downstream applications or sharing within the community. This systematic approach minimizes deployment friction and ensures reproducibility.
© 2025 ApX Machine Learning