Alright, let's put theory into practice. Having explored techniques like profiling and quantization conceptually, this section provides a hands-on exercise to apply these methods to a standard PyTorch model. Our goal is to identify performance characteristics using the profiler and then reduce the model's size and potentially accelerate its inference speed using post-training static quantization (PTQ). This exercise mirrors a common workflow when preparing models for deployment.
We assume you have a working PyTorch environment with torchvision
installed.
First, we need to import the necessary libraries and load a pre-trained model. We'll use ResNet18 from torchvision
as our example model. It's complex enough to show meaningful results but small enough to run quickly for this exercise. We also need some dummy input data.
import torch
import torchvision.models as models
import torch.quantization
import torch.profiler
import copy
import time
import os
import numpy as np
# Check for CUDA availability, fall back to CPU if not available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Load a pre-trained ResNet18 model
original_model = models.resnet18(pretrained=True)
original_model.eval() # Set the model to evaluation mode
model_fp32 = copy.deepcopy(original_model).to(device)
# Create dummy input data matching the expected input shape for ResNet18
# (batch_size, channels, height, width)
dummy_input = torch.randn(1, 3, 224, 224).to(device)
# Function to save model and return size
def get_model_size(model, file_path="temp_model.pt"):
torch.save(model.state_dict(), file_path)
size = os.path.getsize(file_path) / (1024 * 1024) # Size in MB
os.remove(file_path)
return size
Make sure to set the model to evaluation mode using model.eval()
. This is important because it disables layers like Dropout and normalizes BatchNorm layers using running statistics, which is essential for consistent inference and quantization.
Before optimizing, let's establish a baseline. We'll use torch.profiler.profile
to analyze the inference performance of the original FP32 model. The profiler records the execution time and memory consumption of different operations on both CPU and GPU (if available).
# Profile inference for the FP32 model
print("Profiling FP32 model...")
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA, # Only if CUDA is available
],
record_shapes=True, # Optional: records input shapes
profile_memory=True, # Optional: profiles memory usage
with_stack=True # Optional: adds source code context
) as prof:
with torch.profiler.record_function("model_inference"): # Label the code block
for _ in range(10): # Run multiple inferences for stable measurements
model_fp32(dummy_input)
# Print the profiling results sorted by self-CPU time
print("FP32 Model Profiling Results (sorted by self CPU time):")
print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=10))
# Print the profiling results sorted by self-CUDA time (if applicable)
if device.type == 'cuda':
print("\nFP32 Model Profiling Results (sorted by self CUDA time):")
print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=10))
# Get baseline inference time (average over several runs)
start_time = time.time()
with torch.no_grad():
for _ in range(50):
model_fp32(dummy_input)
end_time = time.time()
fp32_inference_time = (end_time - start_time) / 50
print(f"\nFP32 Average Inference Time: {fp32_inference_time:.6f} seconds")
# Get baseline model size
fp32_model_size = get_model_size(model_fp32)
print(f"FP32 Model Size: {fp32_model_size:.2f} MB")
Examine the output tables from the profiler. Look for operations under the Name
column that consume the most time (self_cpu_time_total
or self_cuda_time_total
). For convolutional networks like ResNet, you'll typically see aten::conv2d
, aten::batch_norm
, aten::relu
, and aten::addmm
(for linear layers) dominating the execution time. This analysis confirms where optimization efforts, such as quantization, might yield the most significant benefits.
Now, let's apply PTQ to convert the FP32 model to a quantized INT8 version. Static quantization requires a calibration step using representative data to determine the quantization parameters (scale and zero-point) for activations.
Note: For PTQ on CPU, we typically use the 'fbgemm' backend. For ARM CPUs (common on mobile devices), 'qnnpack' is often preferred. If using CUDA, quantization support is more limited and often relies on specific hardware capabilities (like Tensor Cores) and specific backends or libraries like TensorRT. For simplicity, this example focuses on CPU quantization using 'fbgemm'.
# --- Post-Training Static Quantization ---
print("\nStarting Post-Training Static Quantization...")
# Create a copy of the model for quantization and move to CPU
# Quantization is typically performed on CPU first
quantized_model = copy.deepcopy(original_model)
quantized_model.eval()
quantized_model.cpu() # Move model to CPU for quantization steps
# 1. Fuse Modules: Combine Conv-BN-ReLU sequences for better quantization accuracy and performance
# Note: Fusion list might need adjustment based on the specific model architecture.
# For ResNet, common fusions include Conv-BN, Conv-BN-ReLU.
modules_to_fuse = []
for name, module in quantized_model.named_modules():
if isinstance(module, models.resnet.Bottleneck) or isinstance(module, models.resnet.BasicBlock):
# Find sequences like (conv, bn, relu) or (conv, bn)
# This is a simplified example; a robust implementation might need more complex pattern matching.
seq = []
for child_name, child_module in module.named_children():
# Check for Conv2d, BatchNorm2d, ReLU patterns
# Simple pattern: conv -> bn -> relu or conv -> bn
if isinstance(child_module, (torch.nn.Conv2d, torch.nn.BatchNorm2d, torch.nn.ReLU)):
seq.append(f"{name}.{child_name}")
if len(seq) >= 2: # Found at least conv-bn
# Check if the last two are Conv-BN
is_conv_bn = isinstance(module.get_submodule(seq[-2].split('.')[-1]), torch.nn.Conv2d) and \
isinstance(child_module, torch.nn.BatchNorm2d)
if is_conv_bn:
# Optionally add ReLU if it follows BN
next_module_idx = list(module.named_children()).index((child_name, child_module)) + 1
if next_module_idx < len(list(module.named_children())):
next_child_name, next_child_module = list(module.named_children())[next_module_idx]
if isinstance(next_child_module, torch.nn.ReLU):
modules_to_fuse.append(seq + [f"{name}.{next_child_name}"])
else:
modules_to_fuse.append(seq.copy())
else:
modules_to_fuse.append(seq.copy())
seq = [] # Reset sequence after finding a match
else: # Break sequence if non-fusible layer is encountered
seq = []
# Also consider top-level conv1, bn1, relu
if hasattr(quantized_model, 'conv1') and hasattr(quantized_model, 'bn1') and hasattr(quantized_model, 'relu'):
modules_to_fuse.append(['conv1', 'bn1', 'relu'])
print(f"Modules to fuse: {len(modules_to_fuse)}")
# Apply fusion
if modules_to_fuse:
quantized_model = torch.quantization.fuse_modules(quantized_model, modules_to_fuse, inplace=True)
print("Module fusion complete.")
# 2. Specify Quantization Configuration
# Use 'fbgemm' for x86 CPU. Use 'qnnpack' for ARM CPU.
quantized_model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
print(f"Quantization configuration set to: {quantized_model.qconfig}")
# 3. Prepare the Model for Calibration
# Inserts observers to collect activation statistics
torch.quantization.prepare(quantized_model, inplace=True)
print("Model prepared for calibration (observers inserted).")
# 4. Calibrate the Model
# Run inference on a small representative dataset (calibration data)
# Here we use random data for demonstration; in practice, use a subset of your validation set.
print("Running calibration...")
calibration_data = [torch.randn(1, 3, 224, 224, dtype=torch.float32) for _ in range(100)] # Use ~100 samples
with torch.no_grad():
for input_data in calibration_data:
quantized_model(input_data)
print("Calibration complete.")
# 5. Convert the Model to Quantized Version
# Replaces modules with quantized counterparts and uses collected stats
quantized_model = torch.quantization.convert(quantized_model, inplace=True)
print("Model converted to quantized version (INT8).")
# Ensure the quantized model is in evaluation mode
quantized_model.eval()
Let's visualize the simplified PTQ workflow:
The process involves fusing compatible layers, preparing the model by inserting observers, calibrating with sample data, and finally converting to the quantized format.
Now, let's evaluate the performance of our INT8 quantized model on the CPU and compare it to the original FP32 model. We'll measure inference time and model size.
# Profile the INT8 quantized model on CPU
print("\nProfiling INT8 Quantized model (CPU)...")
# Ensure the dummy input is on CPU for the quantized model
dummy_input_cpu = dummy_input.cpu()
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU], # Quantized model runs on CPU here
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof_quant:
with torch.profiler.record_function("quantized_model_inference"):
# Important: Ensure backend is set for quantized operations
# This is often needed for performance measurements.
torch.backends.quantized.engine = 'fbgemm'
with torch.no_grad():
for _ in range(10):
quantized_model(dummy_input_cpu)
print("INT8 Quantized Model Profiling Results (sorted by self CPU time):")
print(prof_quant.key_averages().table(sort_by="self_cpu_time_total", row_limit=10))
# Measure INT8 inference time
start_time = time.time()
with torch.no_grad():
for _ in range(50):
quantized_model(dummy_input_cpu)
end_time = time.time()
int8_inference_time = (end_time - start_time) / 50
print(f"\nINT8 Average Inference Time (CPU): {int8_inference_time:.6f} seconds")
# Measure INT8 model size
int8_model_size = get_model_size(quantized_model)
print(f"INT8 Model Size: {int8_model_size:.2f} MB")
# --- Comparison ---
print("\n--- Performance Comparison ---")
speedup_factor = fp32_inference_time / int8_inference_time if device.type == 'cpu' else float('nan') # Only compare CPU times directly
size_reduction = fp32_model_size / int8_model_size
print(f"Device used for FP32 inference: {device}")
print(f"FP32 Average Inference Time: {fp32_inference_time:.6f} seconds")
print(f"INT8 Average Inference Time (CPU): {int8_inference_time:.6f} seconds")
if device.type == 'cpu':
print(f"CPU Inference Speedup: {speedup_factor:.2f}x")
else:
print("CPU Inference Speedup: N/A (FP32 ran on GPU)")
print(f"\nFP32 Model Size: {fp32_model_size:.2f} MB")
print(f"INT8 Model Size: {int8_model_size:.2f} MB")
print(f"Model Size Reduction: {size_reduction:.2f}x")
# Optional: Visualize the comparison
import json
chart_data = {
"layout": {
"title": "Model Performance Comparison",
"barmode": "group",
"xaxis": {"title": "Metric"},
"yaxis": {"title": "Value"},
"font": {"family": "sans-serif"}
},
"data": [
{
"type": "bar",
"name": "Inference Time (seconds)",
"x": ["FP32", "INT8 (CPU)"],
"y": [fp32_inference_time, int8_inference_time],
"marker": {"color": "#4dabf7"} # blue
},
{
"type": "bar",
"name": "Model Size (MB)",
"x": ["FP32", "INT8 (CPU)"],
"y": [fp32_model_size, int8_model_size],
"marker": {"color": "#38d9a9"} # teal
}
]
}
# Correctly format y-axis based on data type for clarity
chart_data["layout"]["yaxis"] = {"title": "Time (s) / Size (MB)"}
chart_data["layout"]["yaxis2"] = {
"title": "Model Size (MB)",
"overlaying": "y",
"side": "right",
"showgrid": False,
}
# Assign bars to different axes
chart_data["data"][0]["yaxis"] = "y1"
chart_data["data"][1]["yaxis"] = "y2"
print("\nPerformance Chart Data:")
print(f"```plotly\n{json.dumps(chart_data)}\n```")
Comparison of average inference time and model size between the original FP32 model and the INT8 quantized model. Note that direct speedup comparison is only meaningful if the FP32 model was also run on the CPU.
This practical exercise demonstrated a standard workflow for optimizing a PyTorch model using profiling and post-training static quantization.
torch.profiler
to identify performance characteristics of the original FP32 model. This step is helpful for understanding where computation time is spent and confirming that the layers targeted by quantization (like convolutions) are indeed significant contributors.This hands-on example provides a foundation for applying these optimization techniques. Remember that the specific steps (like the fusion list) and results can vary depending on the model architecture, the chosen quantization backend, and the hardware used for inference. Experimenting with different configurations and evaluating accuracy are important next steps in a real-world deployment scenario.
© 2025 ApX Machine Learning