Understanding where your model spends its time and resources during execution is fundamental to optimization. Before applying techniques like quantization or pruning, you need to identify the performance bottlenecks. Is the CPU holding things back? Is the GPU underutilized? Are specific operations disproportionately slow? The PyTorch Profiler (torch.profiler
) is the standard tool for answering these questions.
The profiler allows you to inspect the time and memory costs associated with different parts of your model's execution, encompassing both Python operations on the CPU and CUDA kernel executions on the GPU. It provides detailed insights that guide your optimization efforts, ensuring you focus on the areas yielding the greatest performance improvements for inference.
The torch.profiler
API provides a comprehensive view of model execution by tracking several key metrics:
cudaMemcpy
), highlighting the time spent moving data between the host (CPU) and the device (GPU).The most common way to use the profiler is via its context manager interface. You wrap the code segment you want to analyze within a with torch.profiler.profile(...)
block.
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity
# Load a pre-trained model (ensure it's in eval mode for inference profiling)
model = models.resnet18().cuda().eval()
inputs = torch.randn(16, 3, 224, 224).cuda() # Example input batch on GPU
# Basic profiling context
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
with record_function("model_inference"): # Optional label for the block
model(inputs)
# Print aggregated statistics
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# Export results for more detailed analysis
# prof.export_chrome_trace("resnet18_trace.json")
# prof.export_stacks("/tmp/profiler_stacks.txt", "self_cuda_time_total")
Let's break down the parameters used in profile()
:
activities
: A list specifying which activities to profile. Common choices are ProfilerActivity.CPU
and ProfilerActivity.CUDA
. Profiling CUDA activity is essential for understanding GPU performance.record_shapes
: If True
, records the input shapes for profiled operators. This is useful for diagnosing shape-related performance issues but adds some overhead.profile_memory
: If True
, enables memory profiling (allocations/deallocations). Significant overhead.with_stack
: If True
, records Python call stacks. Very useful for tracing operators back to source code but has substantial overhead.on_trace_ready
: A callable (often torch.profiler.tensorboard_trace_handler
) to handle exporting results, e.g., directly to TensorBoard.schedule
: Controls profiling duration for long-running jobs. Uses torch.profiler.schedule(wait, warmup, active, repeat)
to define phases: skip initial wait
steps, perform warmup
steps (profiler active but results discarded), record active
steps, and repeat this cycle repeat
times. This is important for excluding initialization overhead and focusing on steady-state performance.The record_function("label")
context manager adds custom labels to the profiler output, making it easier to identify specific logical blocks within your code (like data preprocessing, model forward pass, post-processing).
The profiler object (prof
in the example) provides several methods to analyze the collected data:
key_averages()
This method returns an aggregated summary of operator performance, averaged over the profiling window. Calling .table()
provides a formatted string output.
# Example Output Snippet from prof.key_averages().table(...)
------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CUDA % CUDA total # Calls
------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::convolution 0.14% 293.164us 17.21% 35.96ms 81.90% 35.91ms 20
aten::cudnn_convolution 0.00% 0.000us 0.00% 0.000us 81.72% 35.83ms 20
aten::addmm 0.06% 117.880us 1.12% 2.34ms 4.97% 2.18ms 1
aten::mm 0.00% 0.000us 0.00% 0.000us 4.96% 2.18ms 1
aten::add_ 0.13% 263.820us 0.51% 1.07ms 3.48% 1.53ms 21
aten::relu 0.12% 245.750us 0.28% 577.870us 1.91% 836.370us 16
aten::_native_batch_norm_legit_no_... 0.10% 215.530us 1.99% 4.15ms 1.85% 810.318us 20
aten::empty_strided 1.76% 3.68ms 1.94% 4.05ms 0.00% 0.000us 140
aten::max_pool2d_with_indices 0.04% 79.630us 0.38% 788.380us 0.79% 346.077us 1
aten::copy_ 0.06% 120.600us 0.06% 120.600us 0.00% 0.000us 2
------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 208.96ms
Self CUDA time total: 44.08ms
aten::convolution
). aten
is the namespace for PyTorch's native C++ operators.aten::convolution
calls aten::cudnn_convolution
).You can sort the table (e.g., sort_by="cuda_time_total"
) and limit the rows (row_limit
) to focus on the most expensive operations. Grouping by input shape (group_by_input_shape=True
) or stack trace (group_by_stack_n
) can provide further insights. High Self CUDA
time for an operator like aten::convolution
indicates that the underlying CUDA kernel for convolution is taking significant time, which is often expected but confirms where GPU time is spent. High Self CPU
time might point to Python overhead or CPU-bound computations.
export_chrome_trace()
This method exports the detailed timeline data to a JSON file format compatible with Chrome's tracing tool (chrome://tracing
) or the Perfetto UI (recommended). This visualization is invaluable for understanding the temporal dynamics of your model.
To view the trace: Open Google Chrome, navigate to
chrome://tracing
, and click "Load", or use the Perfetto UI at ui.perfetto.dev.
The trace view typically shows:
Stream 7
), showing kernel executions and memory transfers scheduled on the GPU.Simplified visualization of CPU launching GPU kernels. The Chrome trace provides a detailed timeline view, showing precise start/end times and potential gaps indicating idle periods.
By examining the trace, you can spot:
memcpy
blocks showing time spent moving data.Using torch.profiler.tensorboard_trace_handler
provides an integrated experience within TensorBoard.
import torch
import torchvision.models as models
from torch.profiler import profile, tensorboard_trace_handler
# Model and inputs setup (as before)
model = models.resnet18().cuda().eval()
inputs = torch.randn(16, 3, 224, 224).cuda()
log_dir = "./logs" # Directory for TensorBoard logs
with profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
profile_memory=True, # Optionally track memory
on_trace_ready=tensorboard_trace_handler(log_dir)) as prof:
model(inputs)
print(f"Profiler results saved to {log_dir}. Run: tensorboard --logdir {log_dir}")
Launching TensorBoard (tensorboard --logdir ./logs
) and navigating to the "PyTorch Profiler" tab offers several interactive views:
key_averages
, showing detailed statistics per operator. Allows filtering and searching.profile_memory=True
) Shows memory usage patterns and allocations per operator.Example distribution of total GPU time spent across different operators, derived from profiler data. Convolution often dominates in CNNs.
The profiler output directly points towards optimization opportunities:
key_averages
, significant Python overhead or long self-CPU times for certain operators. Low GPU utilization seen in the trace view (large gaps).num_workers
, pin_memory
).cudaMemcpyDtoH
(Device to Host) or cudaMemcpyHtoD
(Host to Device) visible in the trace view.pin_memory=True
in DataLoader
and non_blocking=True
for .to(device)
calls to overlap transfers with computation.profile_memory=True
. OutOfMemoryError
during execution.torch.no_grad()
for inference, delete tensors that are no longer needed (del tensor
), use checkpointing techniques (trading compute for memory), apply model optimization techniques like quantization or pruning (covered elsewhere in this chapter), reduce batch size.aten::convolution
) show very long execution times in key_averages
or the trace view.with torch.profiler.record_function("my_label"):
to add custom annotations to the profile, making it easier to correlate performance data with specific sections of your code (e.g., "data_preprocessing", "attention_block").schedule
argument to torch.profiler.profile
to capture specific iterations after an initial warmup period, avoiding large trace files and focusing on steady-state behavior.By systematically using the PyTorch Profiler, you gain the necessary visibility into your model's runtime behavior to make informed decisions about where and how to apply optimization techniques effectively, ultimately leading to faster and more efficient models ready for deployment.
© 2025 ApX Machine Learning