Alright, let's translate the theory of static Post-Training Quantization (PTQ) into practice. In this section, we'll walk through the steps of applying static PTQ to a pre-trained transformer model using popular libraries. The goal is to convert the model's weights and potentially activation computations to use lower-precision integers (like INT8), using a calibration dataset to determine the optimal quantization parameters. This process happens after the model has been fully trained.
We will primarily use the Hugging Face ecosystem, specifically the transformers
library to load our model and tokenizer, the datasets
library to fetch our calibration data, and the optimum
library which provides tools for model optimization, including quantization, often leveraging backends like ONNX Runtime.
First, ensure you have the necessary libraries installed. You'll need transformers
, datasets
, optimum
, and a backend supported by optimum
for quantization, such as onnxruntime
. You might also need accelerate
for smoother model handling.
pip install transformers datasets optimum[onnxruntime] accelerate torch
Note: Ensure you have PyTorch installed as well, as it's often required by transformers
and optimum
.
We start by loading a standard pre-trained model. For this example, let's use a smaller, well-known model like distilbert-base-uncased
to keep the process manageable.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_id = "distilbert-base-uncased"
# Load a model suitable for a task, e.g., sequence classification
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
print("Original model loaded:")
print(model)
This loads the standard FP32 (32-bit floating-point) version of the model. Our goal is to convert this to a static INT8 representation.
Static PTQ relies on a small, representative dataset, known as the calibration dataset. This data is used to observe the distribution of activations within the model during inference. These observed ranges (minimum and maximum values) are crucial for calculating the scaling factors and zero-points needed for mapping FP32 values to INT8.
Let's load a small subset of a dataset, for instance, the 'sst2' (Stanford Sentiment Treebank) dataset, often used for classification tasks. We only need a few hundred samples for calibration.
from datasets import load_dataset
# Load a dataset suitable for the model's task (e.g., sentiment analysis)
calibration_dataset_name = "sst2"
num_calibration_samples = 200 # A small number is often sufficient
# Load and select a subset
full_calibration_dataset = load_dataset(calibration_dataset_name, split="train")
calibration_indices = list(range(num_calibration_samples))
calibration_subset = full_calibration_dataset.select(calibration_indices)
# Preprocessing function
def preprocess_function(examples):
# Adjust 'sentence' to the actual text column name if different
return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)
# Apply preprocessing
processed_calibration_dataset = calibration_subset.map(preprocess_function, batched=True)
print(f"Prepared calibration dataset with {len(processed_calibration_dataset)} samples.")
# Display one processed sample's keys
print("Processed sample keys:", processed_calibration_dataset[0].keys())
The key is that this dataset should reflect the type of data the model will encounter during actual inference. We preprocess it using the model's tokenizer, ensuring the inputs match what the model expects.
Now, we use optimum
to define our quantization configuration. We specify that we want static INT8 quantization. Optimum
leverages external toolkits (like ONNX Runtime's quantization tools in this case) under the hood.
from optimum.onnxruntime import ORTQuantizer, AutoQuantizationConfig
from optimum.onnxruntime.configuration import Arm64QuantizationConfig, AutoCalibrationConfig
# 1. Define the quantization strategy: Static INT8
qconfig = AutoQuantizationConfig.int8_static(
config=model.config, # Provide model config
dataset=processed_calibration_dataset # The calibration dataset prepared earlier
)
# For ONNX Runtime, we often export to ONNX first
onnx_model_path = "distilbert_base_uncased_onnx"
quantized_model_path = "distilbert_base_uncased_quantized_onnx"
# 2. Create the quantizer using the model ID (or path)
quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.config.task_specific_params.get("feature", "sequence-classification"))
print("Quantization configuration created:")
print(qconfig)
Here, AutoQuantizationConfig.int8_static
conveniently sets up the configuration for static INT8 quantization. We provide the model's configuration and our processed calibration dataset. ORTQuantizer
is initialized with the model identifier.
With the model loaded, calibration data ready, and configuration defined, we can now perform the quantization. The quantizer.quantize
method handles exporting the model to an intermediate format (like ONNX), running calibration, calculating quantization parameters, and generating the final quantized model.
# 3. Apply quantization
quantizer.quantize(
save_dir=quantized_model_path,
quantization_config=qconfig,
# Optional: specify path for intermediate ONNX model
# onnx_model_path=onnx_model_path
)
print(f"Static PTQ complete. Quantized model saved to: {quantized_model_path}")
This process might take a few minutes depending on the model size and the number of calibration samples. It involves:
A primary benefit of quantization is model size reduction. Let's compare the size of the original PyTorch model's state dictionary with the quantized ONNX model file.
import os
# Function to get directory size
def get_dir_size(path='.'):
total = 0
with os.scandir(path) as it:
for entry in it:
if entry.is_file():
total += entry.stat().st_size
elif entry.is_dir():
total += get_dir_size(entry.path)
return total / (1024 * 1024) # Size in MB
# Approximate size of original PyTorch model (can vary slightly)
original_model_size_mb = sum(p.numel() * p.element_size() for p in model.parameters()) / (1024 * 1024)
# Size of the quantized ONNX model directory
quantized_model_size_mb = get_dir_size(quantized_model_path)
print(f"Approx. Original FP32 Model Size: {original_model_size_mb:.2f} MB")
print(f"Quantized INT8 Model Size (ONNX): {quantized_model_size_mb:.2f} MB")
# Calculate reduction
reduction = (1 - (quantized_model_size_mb / original_model_size_mb)) * 100
print(f"Size Reduction: {reduction:.2f}%")
# Optional: Visualize the size comparison
Approximate size comparison between the original FP32 model and the statically quantized INT8 model (ONNX format). Sizes are indicative and may vary based on exact model and saving formats.
You should observe a significant reduction in model size, often close to 4x when moving from FP32 to INT8, as INT8 uses only a quarter of the bits per parameter.
While detailed evaluation is covered in Chapter 6, you can load the quantized model using optimum
and ONNX Runtime to perform a quick inference check.
# Requires onnxruntime installed
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import pipeline
# Load the quantized model
quantized_ort_model = ORTModelForSequenceClassification.from_pretrained(quantized_model_path)
# Create a pipeline for easy inference
classifier = pipeline("sentiment-analysis", model=quantized_ort_model, tokenizer=tokenizer)
# Test inference
text = "This movie was quite good, I enjoyed it."
result = classifier(text)
print("Inference result from quantized model:", result)
text_neg = "The plot was predictable and the acting was mediocre."
result_neg = classifier(text_neg)
print("Inference result from quantized model:", result_neg)
This confirms that the quantized model can be loaded and produces outputs. Remember that static PTQ can sometimes lead to a drop in accuracy, which needs careful evaluation for your specific use case.
In this hands-on section, we applied static Post-Training Quantization to a transformers
model using the optimum
library. We loaded a pre-trained model, prepared a calibration dataset, configured the quantization process for static INT8 conversion, executed the quantization using ONNX Runtime as the backend, and observed the resulting model size reduction. This practical workflow demonstrates how PTQ can be applied efficiently without needing access to the original training pipeline or requiring expensive retraining, making models smaller and potentially faster for inference. The next chapters will explore more advanced PTQ techniques that aim to mitigate potential accuracy loss and delve into Quantization-Aware Training (QAT).
© 2025 ApX Machine Learning