Exporting Models to Safetensors

Saving a merged model architecture to disk requires a reliable serialization format. Historically, the standard method for saving PyTorch models involved using the Pickle module to serialize Python objects into .bin or .pt files. While functional, this approach introduces significant security vulnerabilities and performance bottlenecks that make it unsuitable for production environments.

The Vulnerabilities of Pickle Serialization

Pickle files do not just store data. They can also contain arbitrary executable code. When you load a pickled model using traditional PyTorch methods, you are instructing the Python interpreter to execute whatever instructions are embedded inside that file. In a production setting where models might be shared, downloaded from public hubs, or transferred between servers, this creates an unacceptable security risk. A malicious actor could inject code into a model file that executes the moment the model is loaded into memory.

Furthermore, loading large pickled files requires allocating memory and copying data layer by layer. The operating system must read the entire file into system RAM before moving it to the GPU memory space. This redundant copying slows down model initialization and frequently causes out-of-memory errors on machines with limited system RAM.

Memory Mapping and Zero-Copy Operations

To solve these security and performance problems, Hugging Face developed the Safetensors format. Safetensors is an open-source serialization format specifically designed for storing and loading machine learning tensors. It restricts the saved file to only contain raw tensor data and metadata, completely removing the ability to execute arbitrary code.

From a performance perspective, Safetensors utilizes memory mapping (mmap). Instead of reading the entire file into CPU memory and then copying it, the operating system maps the file directly into virtual memory. Tensors are loaded into the target device, such as a GPU, only when they are accessed. This bypasses redundant copies in system RAM, resulting in a zero-copy loading process that drastically reduces the time it takes to instantiate a language model.

Memory allocation paths comparing traditional Pickle serialization to Safetensors memory mapping.

Implementing Safetensors in PyTorch

The Hugging Face Transformers library provides native support for the Safetensors format. When you are ready to export your merged model, you can use the save_pretrained method and explicitly declare your serialization preference.

import os

# Define the output directory for your deployment-ready model
output_dir = "./my-slm-production"
os.makedirs(output_dir, exist_ok=True)

# Save the merged model using Safetensors
merged_model.save_pretrained(
    output_dir,
    safe_serialization=True
)

# Save the tokenizer alongside the model
tokenizer.save_pretrained(output_dir)

Setting safe_serialization=True guarantees that the model weights are written to .safetensors files instead of standard .bin files. When you examine the contents of your output directory, you will find files named model.safetensors along with the necessary configuration JSON files.

For larger models, the Transformers library automatically chunks the weights into multiple files, such as model-00001-of-00002.safetensors. This chunking process ensures that individual file sizes remain manageable for file systems and network transfers. An index file named model.safetensors.index.json is generated alongside the chunks to map which specific layers are stored in which file.

Validating the Exported Model

Before moving to the serving infrastructure, it is a good practice to verify that the model loads correctly from the newly created Safetensors files. You can initialize the model directly from the local directory.

from transformers import AutoModelForCausalLM

# Load the model directly from the Safetensors directory
production_model = AutoModelForCausalLM.from_pretrained(
    output_dir,
    local_files_only=True,
    device_map="auto"
)

By passing local_files_only=True, you prevent the library from attempting to contact the Hugging Face Hub, ensuring that your local memory-mapped loading path is functioning correctly. The device_map="auto" argument automatically distributes the loaded tensors across available hardware.

Exporting your fine-tuned model into a fast, secure format establishes a reliable foundation for deployment. With the tensors sitting on disk ready for rapid memory-mapped access, the model is prepared to be ingested by a high-throughput inference engine.

References

Safetensors: A Simple and Safe Way to Store and Load Tensors, Hugging Face Team, 2023 - Official documentation explaining the design goals, security benefits, and implementation details of the Safetensors format.
Transformers Documentation: Sharing and Uploading Models, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, Alexander M. Rush, 2020 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Association for Computational Linguistics) DOI: 10.18653/v1/2020.emnlp-demos.6 - Detailed API reference for the save_pretrained method and its serialization arguments in the Transformers library.
Memory-mapped Files, Michael Kerrisk, 2010 (No Starch Press) - Reference for the underlying operating system mechanism (mmap) used by Safetensors for zero-copy loading.