A hands-on exercise applies profiling and quantization methods to a standard PyTorch model. The primary objective is to identify performance characteristics using the profiler. Following this, reduce the model's size and potentially accelerate its inference speed using post-training static quantization (PTQ). This exercise mirrors a common workflow when preparing models for deployment.We assume you have a working PyTorch environment with torchvision installed.Setting Up the Environment and ModelFirst, we need to import the necessary libraries and load a pre-trained model. We'll use ResNet18 from torchvision as our example model. It's complex enough to show meaningful results but small enough to run quickly for this exercise. We also need some dummy input data.import torch import torchvision.models as models import torch.quantization import torch.profiler import copy import time import os import numpy as np # Check for CUDA availability, fall back to CPU if not available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}") # Load a pre-trained ResNet18 model original_model = models.resnet18(pretrained=True) original_model.eval() # Set the model to evaluation mode model_fp32 = copy.deepcopy(original_model).to(device) # Create dummy input data matching the expected input shape for ResNet18 # (batch_size, channels, height, width) dummy_input = torch.randn(1, 3, 224, 224).to(device) # Function to save model and return size def get_model_size(model, file_path="temp_model.pt"): torch.save(model.state_dict(), file_path) size = os.path.getsize(file_path) / (1024 * 1024) # Size in MB os.remove(file_path) return sizeMake sure to set the model to evaluation mode using model.eval(). This is important because it disables layers like Dropout and normalizes BatchNorm layers using running statistics, which is essential for consistent inference and quantization.Profiling the Original Floating-Point ModelBefore optimizing, let's establish a baseline. We'll use torch.profiler.profile to analyze the inference performance of the original FP32 model. The profiler records the execution time and memory consumption of different operations on both CPU and GPU (if available).# Profile inference for the FP32 model print("Profiling FP32 model...") with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, # Only if CUDA is available ], record_shapes=True, # Optional: records input shapes profile_memory=True, # Optional: profiles memory usage with_stack=True # Optional: adds source code context ) as prof: with torch.profiler.record_function("model_inference"): # Label the code block for _ in range(10): # Run multiple inferences for stable measurements model_fp32(dummy_input) # Print the profiling results sorted by self-CPU time print("FP32 Model Profiling Results (sorted by self CPU time):") print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=10)) # Print the profiling results sorted by self-CUDA time (if applicable) if device.type == 'cuda': print("\nFP32 Model Profiling Results (sorted by self CUDA time):") print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=10)) # Get baseline inference time (average over several runs) start_time = time.time() with torch.no_grad(): for _ in range(50): model_fp32(dummy_input) end_time = time.time() fp32_inference_time = (end_time - start_time) / 50 print(f"\nFP32 Average Inference Time: {fp32_inference_time:.6f} seconds") # Get baseline model size fp32_model_size = get_model_size(model_fp32) print(f"FP32 Model Size: {fp32_model_size:.2f} MB")Examine the output tables from the profiler. Look for operations under the Name column that consume the most time (self_cpu_time_total or self_cuda_time_total). For convolutional networks like ResNet, you'll typically see aten::conv2d, aten::batch_norm, aten::relu, and aten::addmm (for linear layers) dominating the execution time. This analysis confirms where optimization efforts, such as quantization, might yield the most significant benefits.Applying Post-Training Static Quantization (PTQ)Now, let's apply PTQ to convert the FP32 model to a quantized INT8 version. Static quantization requires a calibration step using representative data to determine the quantization parameters (scale and zero-point) for activations.Note: For PTQ on CPU, we typically use the 'fbgemm' backend. For ARM CPUs (common on mobile devices), 'qnnpack' is often preferred. If using CUDA, quantization support is more limited and often relies on specific hardware capabilities (like Tensor Cores) and specific backends or libraries like TensorRT. For simplicity, this example focuses on CPU quantization using 'fbgemm'.# --- Post-Training Static Quantization --- print("\nStarting Post-Training Static Quantization...") # Create a copy of the model for quantization and move to CPU # Quantization is typically performed on CPU first quantized_model = copy.deepcopy(original_model) quantized_model.eval() quantized_model.cpu() # Move model to CPU for quantization steps # 1. Fuse Modules: Combine Conv-BN-ReLU sequences for better quantization accuracy and performance # Note: Fusion list might need adjustment based on the specific model architecture. # For ResNet, common fusions include Conv-BN, Conv-BN-ReLU. modules_to_fuse = [] for name, module in quantized_model.named_modules(): if isinstance(module, models.resnet.Bottleneck) or isinstance(module, models.resnet.BasicBlock): # Find sequences like (conv, bn, relu) or (conv, bn) # This is a simplified example; an implementation might need more complex pattern matching. seq = [] for child_name, child_module in module.named_children(): # Check for Conv2d, BatchNorm2d, ReLU patterns # Simple pattern: conv -> bn -> relu or conv -> bn if isinstance(child_module, (torch.nn.Conv2d, torch.nn.BatchNorm2d, torch.nn.ReLU)): seq.append(f"{name}.{child_name}") if len(seq) >= 2: # Found at least conv-bn # Check if the last two are Conv-BN is_conv_bn = isinstance(module.get_submodule(seq[-2].split('.')[-1]), torch.nn.Conv2d) and \ isinstance(child_module, torch.nn.BatchNorm2d) if is_conv_bn: # Optionally add ReLU if it follows BN next_module_idx = list(module.named_children()).index((child_name, child_module)) + 1 if next_module_idx < len(list(module.named_children())): next_child_name, next_child_module = list(module.named_children())[next_module_idx] if isinstance(next_child_module, torch.nn.ReLU): modules_to_fuse.append(seq + [f"{name}.{next_child_name}"]) else: modules_to_fuse.append(seq.copy()) else: modules_to_fuse.append(seq.copy()) seq = [] # Reset sequence after finding a match else: # Break sequence if non-fusible layer is encountered seq = [] # Also consider top-level conv1, bn1, relu if hasattr(quantized_model, 'conv1') and hasattr(quantized_model, 'bn1') and hasattr(quantized_model, 'relu'): modules_to_fuse.append(['conv1', 'bn1', 'relu']) print(f"Modules to fuse: {len(modules_to_fuse)}") # Apply fusion if modules_to_fuse: quantized_model = torch.quantization.fuse_modules(quantized_model, modules_to_fuse, inplace=True) print("Module fusion complete.") # 2. Specify Quantization Configuration # Use 'fbgemm' for x86 CPU. Use 'qnnpack' for ARM CPU. quantized_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') print(f"Quantization configuration set to: {quantized_model.qconfig}") # 3. Prepare the Model for Calibration # Inserts observers to collect activation statistics torch.quantization.prepare(quantized_model, inplace=True) print("Model prepared for calibration (observers inserted).") # 4. Calibrate the Model # Run inference on a small representative dataset (calibration data) # Here we use random data for demonstration; in practice, use a subset of your validation set. print("Running calibration...") calibration_data = [torch.randn(1, 3, 224, 224, dtype=torch.float32) for _ in range(100)] # Use ~100 samples with torch.no_grad(): for input_data in calibration_data: quantized_model(input_data) print("Calibration complete.") # 5. Convert the Model to Quantized Version # Replaces modules with quantized counterparts and uses collected stats quantized_model = torch.quantization.convert(quantized_model, inplace=True) print("Model converted to quantized version (INT8).") # Ensure the quantized model is in evaluation mode quantized_model.eval()Let's visualize the simplified PTQ workflow:digraph PTQ_Flow { rankdir=LR; node [shape=box, style=rounded, fontname="sans-serif", color="#adb5bd", fontcolor="#495057"]; edge [fontname="sans-serif", color="#868e96"]; FP32 [label="FP32 Model\n(Pre-trained)", fillcolor="#a5d8ff", style="rounded,filled"]; Fused [label="Fused Model\n(Conv-BN-ReLU)", fillcolor="#a5d8ff", style="rounded,filled"]; Prepared [label="Prepared Model\n(Observers Added)", fillcolor="#ffec99", style="rounded,filled"]; Calibrated [label="Calibrated Model\n(Stats Collected)", fillcolor="#ffec99", style="rounded,filled"]; INT8 [label="INT8 Model\n(Quantized)", fillcolor="#b2f2bb", style="rounded,filled"]; Data [label="Calibration Data", shape=cylinder, style=filled, fillcolor="#e9ecef"]; FP32 -> Fused [label=" Fuse Modules "]; Fused -> Prepared [label=" Prepare "]; Data -> Calibrated [label=" Run Inference "]; Prepared -> Calibrated [label=" Calibrate "]; Calibrated -> INT8 [label=" Convert "]; }The process involves fusing compatible layers, preparing the model by inserting observers, calibrating with sample data, and finally converting to the quantized format.Evaluating the Quantized ModelNow, let's evaluate the performance of our INT8 quantized model on the CPU and compare it to the original FP32 model. We'll measure inference time and model size.# Profile the INT8 quantized model on CPU print("\nProfiling INT8 Quantized model (CPU)...") # Ensure the dummy input is on CPU for the quantized model dummy_input_cpu = dummy_input.cpu() with torch.profiler.profile( activities=[torch.profiler.ProfilerActivity.CPU], # Quantized model runs on CPU here record_shapes=True, profile_memory=True, with_stack=True ) as prof_quant: with torch.profiler.record_function("quantized_model_inference"): # Important: Ensure backend is set for quantized operations # This is often needed for performance measurements. torch.backends.quantized.engine = 'fbgemm' with torch.no_grad(): for _ in range(10): quantized_model(dummy_input_cpu) print("INT8 Quantized Model Profiling Results (sorted by self CPU time):") print(prof_quant.key_averages().table(sort_by="self_cpu_time_total", row_limit=10)) # Measure INT8 inference time start_time = time.time() with torch.no_grad(): for _ in range(50): quantized_model(dummy_input_cpu) end_time = time.time() int8_inference_time = (end_time - start_time) / 50 print(f"\nINT8 Average Inference Time (CPU): {int8_inference_time:.6f} seconds") # Measure INT8 model size int8_model_size = get_model_size(quantized_model) print(f"INT8 Model Size: {int8_model_size:.2f} MB") # --- Comparison --- print("\n--- Performance Comparison ---") speedup_factor = fp32_inference_time / int8_inference_time if device.type == 'cpu' else float('nan') # Only compare CPU times directly size_reduction = fp32_model_size / int8_model_size print(f"Device used for FP32 inference: {device}") print(f"FP32 Average Inference Time: {fp32_inference_time:.6f} seconds") print(f"INT8 Average Inference Time (CPU): {int8_inference_time:.6f} seconds") if device.type == 'cpu': print(f"CPU Inference Speedup: {speedup_factor:.2f}x") else: print("CPU Inference Speedup: N/A (FP32 ran on GPU)") print(f"\nFP32 Model Size: {fp32_model_size:.2f} MB") print(f"INT8 Model Size: {int8_model_size:.2f} MB") print(f"Model Size Reduction: {size_reduction:.2f}x") # Optional: Visualize the comparison import json chart_data = { "layout": { "title": "Model Performance Comparison", "barmode": "group", "xaxis": {"title": "Metric"}, "yaxis": {"title": "Value"}, "font": {"family": "sans-serif"} }, "data": [ { "type": "bar", "name": "Inference Time (seconds)", "x": ["FP32", "INT8 (CPU)"], "y": [fp32_inference_time, int8_inference_time], "marker": {"color": "#4dabf7"} # blue }, { "type": "bar", "name": "Model Size (MB)", "x": ["FP32", "INT8 (CPU)"], "y": [fp32_model_size, int8_model_size], "marker": {"color": "#38d9a9"} # teal } ] } # Correctly format y-axis based on data type for clarity chart_data["layout"]["yaxis"] = {"title": "Time (s) / Size (MB)"} chart_data["layout"]["yaxis2"] = { "title": "Model Size (MB)", "overlaying": "y", "side": "right", "showgrid": False, } # Assign bars to different axes chart_data["data"][0]["yaxis"] = "y1" chart_data["data"][1]["yaxis"] = "y2" print("\nPerformance Chart Data:") print(f"```plotly\n{json.dumps(chart_data)}\n```") Comparison of average inference time and model size between the original FP32 model and the INT8 quantized model. Note that direct speedup comparison is only meaningful if the FP32 model was also run on the CPU.DiscussionThis practical exercise demonstrated a standard workflow for optimizing a PyTorch model using profiling and post-training static quantization.Profiling: We used torch.profiler to identify performance characteristics of the original FP32 model. This step is helpful for understanding where computation time is spent and confirming that the layers targeted by quantization (like convolutions) are indeed significant contributors.Quantization: We applied PTQ, which involved fusing modules, preparing the model with observers, calibrating with sample data, and converting the model to INT8.Evaluation: Comparing the INT8 model to the FP32 baseline typically shows:Reduced Model Size: INT8 weights and activations require significantly less storage (usually around 4x reduction).Faster CPU Inference: INT8 operations can be executed more efficiently on CPUs that support specialized instructions, leading to noticeable speedups. GPU speedups depend heavily on hardware support and the specific operations.Potential Accuracy Trade-off: While PTQ aims to minimize accuracy loss, some degradation might occur. It's important to evaluate the quantized model on your specific task and validation dataset to ensure it still meets requirements. If accuracy drops significantly, techniques like Quantization-Aware Training (QAT), discussed earlier, might be necessary."This hands-on example provides a foundation for applying these optimization techniques. Remember that the specific steps (like the fusion list) and results can vary depending on the model architecture, the chosen quantization backend, and the hardware used for inference. Experimenting with different configurations and evaluating accuracy are important next steps in a deployment scenario."