In previous sections, we discussed strategies for optimizing large language models for deployment, focusing on techniques like quantization to reduce model size and potentially accelerate inference. Now, let's put that theory into practice. This hands-on exercise guides you through applying post-training quantization to a smaller transformer model and deploying it using a simple web server. While we use a manageable model size for this practice, the workflow illustrates the fundamental steps involved in deploying optimized LLMs.
Prerequisites:
Before starting, ensure you have Python installed along with the necessary libraries. You can install them using pip:
pip install torch transformers optimum[onnxruntime] fastapi uvicorn[standard] psutil
We'll use the Hugging Face transformers
library to load a pre-trained model, optimum
to handle the ONNX conversion and quantization process, torch
as the backend, fastapi
to create a simple web service, uvicorn
to run the server, and psutil
to check memory usage (as a proxy for model size).
First, let's load a standard pre-trained transformer model. We'll use distilbert-base-uncased
, a smaller, faster version of BERT, which is suitable for this demonstration. In a real-world LLMOps scenario, you would replace this with your specific large model checkpoint.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import os
import psutil
# Define model name and task
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
task = "text-classification" # Optimum needs task for ONNX export
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_fp32 = AutoModelForSequenceClassification.from_pretrained(model_name)
# Function to get model size in memory (approximate)
def get_model_size_mb(model):
mem_params = sum([param.nelement()*param.element_size() for param in model.parameters()])
mem_bufs = sum([buf.nelement()*buf.element_size() for buf in model.buffers()])
total_mem_bytes = mem_params + mem_bufs
return total_mem_bytes / (1024**2) # Convert bytes to megabytes
# Get size of the original FP32 model
model_fp32_size = get_model_size_mb(model_fp32)
print(f"Original FP32 model size: {model_fp32_size:.2f} MB")
# Save the tokenizer for later use with the quantized model
tokenizer.save_pretrained("./model_fp32")
model_fp32.save_pretrained("./model_fp32") # Save FP32 model if needed for direct comparison later
# Optional: Clean up memory if running in constrained environment
# del model_fp32
# torch.cuda.empty_cache() # If using GPU
This code loads the tokenizer and the standard floating-point precision (FP32) model. We also define a helper function to estimate the model's size in memory.
Now, we'll use the optimum
library, which simplifies the process of optimizing models, including quantization via ONNX Runtime. We'll convert the PyTorch model to ONNX format and then apply dynamic quantization to INT8.
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig
# Define output directories
onnx_fp32_path = "./onnx_fp32"
onnx_int8_path = "./onnx_int8"
os.makedirs(onnx_fp32_path, exist_ok=True)
os.makedirs(onnx_int8_path, exist_ok=True)
# 1. Export the model to ONNX FP32 format
model_fp32_onnx = ORTModelForSequenceClassification.from_pretrained(model_name, export=True, task=task)
model_fp32_onnx.save_pretrained(onnx_fp32_path)
tokenizer.save_pretrained(onnx_fp32_path) # Save tokenizer with ONNX model
print(f"FP32 ONNX model saved to: {onnx_fp32_path}")
# 2. Create a Quantizer from the FP32 ONNX model
quantizer = ORTQuantizer.from_pretrained(onnx_fp32_path, file_name="model.onnx")
# 3. Define the quantization configuration (dynamic INT8)
# AVX2 or VNNI instruction sets are often beneficial here if available
qconfig = AutoQuantizationConfig.avx2(is_static=False, per_channel=False) # Dynamic quantization
# 4. Apply quantization
quantizer.quantize(save_dir=onnx_int8_path, quantization_config=qconfig)
print(f"INT8 ONNX model saved to: {onnx_int8_path}")
# Save tokenizer with the quantized model as well
tokenizer.save_pretrained(onnx_int8_path)
# Optional: Clean up intermediate ONNX model
# del model_fp32_onnx
# del quantizer
# torch.cuda.empty_cache() # If using GPU
This process involves:
optimum
handles this conversion.ORTQuantizer
object.quantize
method, which applies the configuration and saves the INT8 quantized model to the specified directory.A primary benefit of quantization is model size reduction. Let's compare the disk footprint of the FP32 ONNX model and the INT8 quantized ONNX model.
import os
def get_dir_size_mb(path='.'):
total = 0
with os.scandir(path) as it:
for entry in it:
if entry.is_file():
total += entry.stat().st_size
elif entry.is_dir():
total += get_dir_size_mb(entry.path)
return total / (1024**2) # Convert bytes to megabytes
# Calculate sizes
fp32_onnx_size = get_dir_size_mb(onnx_fp32_path)
int8_onnx_size = get_dir_size_mb(onnx_int8_path)
print(f"FP32 ONNX model directory size: {fp32_onnx_size:.2f} MB")
print(f"INT8 ONNX model directory size: {int8_onnx_size:.2f} MB")
print(f"Size reduction: {(1 - int8_onnx_size / fp32_onnx_size) * 100:.2f}%")
# Data for Plotly chart
size_data = {
"models": ["FP32 ONNX", "INT8 ONNX"],
"sizes_mb": [round(fp32_onnx_size, 2), round(int8_onnx_size, 2)]
}
Comparison of the approximate disk size for the FP32 and INT8 quantized ONNX models. Note: Exact sizes may vary slightly based on dependencies and file structure.
You should observe a significant reduction in size, often close to 4x when moving from FP32 to INT8, as each parameter now requires only 8 bits instead of 32.
Before deploying, let's quickly load the quantized model using optimum
and run a sample inference to verify it works.
from optimum.onnxruntime import ORTModelForSequenceClassification
import time
# Load the quantized model
quantized_model = ORTModelForSequenceClassification.from_pretrained(onnx_int8_path)
quantized_tokenizer = AutoTokenizer.from_pretrained(onnx_int8_path)
# Sample text
text = "This movie was absolutely fantastic!"
inputs = quantized_tokenizer(text, return_tensors="pt") # PyTorch tensors expected by default
# Warm-up run (optional, improves timing accuracy)
_ = quantized_model(**inputs)
# Time the inference
start_time = time.time()
outputs = quantized_model(**inputs)
end_time = time.time()
# Process output
logits = outputs.logits
predicted_class_id = torch.argmax(logits, dim=1).item()
prediction = quantized_model.config.id2label[predicted_class_id]
print(f"Input Text: '{text}'")
print(f"Predicted Class: {prediction} (ID: {predicted_class_id})")
print(f"Inference time (INT8 ONNX): {end_time - start_time:.4f} seconds")
# Optional: Compare with FP32 ONNX model inference time
# try:
# fp32_model_onnx = ORTModelForSequenceClassification.from_pretrained(onnx_fp32_path)
# fp32_tokenizer = AutoTokenizer.from_pretrained(onnx_fp32_path)
# inputs_fp32 = fp32_tokenizer(text, return_tensors="pt")
# _ = fp32_model_onnx(**inputs_fp32) # Warm-up
# start_time_fp32 = time.time()
# outputs_fp32 = fp32_model_onnx(**inputs_fp32)
# end_time_fp32 = time.time()
# print(f"Inference time (FP32 ONNX): {end_time_fp32 - start_time_fp32:.4f} seconds")
# except Exception as e:
# print(f"Could not run FP32 ONNX comparison: {e}")
# Clean up models from memory
# del quantized_model
# if 'fp32_model_onnx' in locals(): del fp32_model_onnx
# torch.cuda.empty_cache() # if using GPU
While performance gains depend heavily on the hardware (CPU features like AVX2/VNNI, or GPU capabilities), quantization generally leads to faster inference due to reduced memory bandwidth requirements and potentially faster computation on integer units. The accuracy impact of post-training quantization should also be evaluated on a relevant dataset, although for many models and tasks, INT8 quantization maintains acceptable accuracy.
Now, let's create a simple web service using FastAPI to serve our quantized INT8 model. This server will load the model and tokenizer, accept text input via HTTP requests, and return the classification prediction.
Create a file named serve_quantized.py
:
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
import torch
import time
import os
# Define the path to the quantized model
MODEL_DIR = "./onnx_int8"
# Check if model directory exists
if not os.path.exists(MODEL_DIR):
raise RuntimeError(f"Model directory not found: {MODEL_DIR}. Please run the quantization steps first.")
# Load the quantized model and tokenizer when the server starts
try:
print(f"Loading quantized model from {MODEL_DIR}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = ORTModelForSequenceClassification.from_pretrained(MODEL_DIR)
# Perform a dummy inference to potentially optimize JIT compilation etc.
_ = model(**tokenizer("warmup", return_tensors="pt"))
print("Model loaded successfully.")
except Exception as e:
print(f"Error loading model: {e}")
# Optionally exit or handle error appropriately
raise RuntimeError("Failed to load the quantized model.") from e
# Initialize FastAPI app
app = FastAPI(title="Quantized Model Serving API")
# Define the request body structure
class InferenceRequest(BaseModel):
text: str
# Define the response body structure
class InferenceResponse(BaseModel):
prediction: str
confidence: float
latency_ms: float
@app.post("/predict", response_model=InferenceResponse)
async def predict(request: InferenceRequest):
"""
Performs inference on the input text using the quantized model.
"""
start_time = time.time()
# Tokenize the input text
inputs = tokenizer(request.text, return_tensors="pt", truncation=True, max_length=512)
# Perform inference
with torch.no_grad(): # Ensure gradients are not calculated
outputs = model(**inputs)
# Process the output
logits = outputs.logits
probabilities = torch.softmax(logits, dim=1)
predicted_class_id = torch.argmax(probabilities, dim=1).item()
confidence = probabilities[0, predicted_class_id].item()
prediction_label = model.config.id2label[predicted_class_id]
end_time = time.time()
latency_ms = (end_time - start_time) * 1000
return InferenceResponse(
prediction=prediction_label,
confidence=round(confidence, 4),
latency_ms=round(latency_ms, 2)
)
@app.get("/")
async def root():
return {"message": "Quantized Model Server is running. Use POST /predict to make predictions."}
# If running this script directly, start the Uvicorn server
if __name__ == "__main__":
import uvicorn
# Ensure the host is accessible if running inside Docker or VM
uvicorn.run(app, host="0.0.0.0", port=8000)
This script defines:
ORTModelForSequenceClassification
from the ./onnx_int8
directory when the server starts./predict
): Accepts POST requests with JSON containing {"text": "your input text"}
.Now, run the FastAPI server from your terminal in the same directory where you ran the previous Python code and where the onnx_int8
directory exists:
python serve_quantized.py
Uvicorn will start the server, typically listening on http://127.0.0.1:8000
.
You can test the endpoint using curl
from another terminal:
curl -X POST "http://127.0.0.1:8000/predict" \
-H "Content-Type: application/json" \
-d '{"text": "This framework makes deployment so much easier!"}'
Or using Python with the requests
library:
import requests
import json
url = "http://127.0.0.1:8000/predict"
payload = {"text": "Optimum and ONNX Runtime provide great optimizations."}
headers = {"Content-Type": "application/json"}
response = requests.post(url, data=json.dumps(payload), headers=headers)
if response.status_code == 200:
print("Request successful!")
print(response.json())
else:
print(f"Request failed with status code: {response.status_code}")
print(response.text)
payload = {"text": "I am not sure about this, it seems quite complicated."}
response = requests.post(url, data=json.dumps(payload), headers=headers)
if response.status_code == 200:
print("\nRequest successful!")
print(response.json())
else:
print(f"\nRequest failed with status code: {response.status_code}")
print(response.text)
You should receive JSON responses containing the sentiment classification (POSITIVE
or NEGATIVE
for this model), confidence score, and the time taken for the prediction.
In this practice, you successfully applied post-training dynamic quantization to a transformer model using Hugging Face optimum
and ONNX Runtime. You observed the reduction in model size and deployed the optimized model using a simple FastAPI server.
This exercise demonstrates the core workflow:
While we used a relatively small model and dynamic quantization for simplicity, the principles extend to larger models and other optimization techniques discussed in this chapter. For production LLM deployment, you would typically:
© 2025 ApX Machine Learning