Before performing detailed, cycle-accurate analysis of individual CPU functions or GPU kernels, it's essential to understand the system's overall behavior during the execution of your compiled ML workload. System-level profiling provides this macroscopic view, helping you identify the primary resource limitations: Are you bound by CPU processing, GPU computation, memory bandwidth, data transfer speeds between host and device, or a combination of these? Answering this question directs your subsequent optimization efforts more effectively.
Compiled ML models often involve intricate interactions between the host CPU (running the runtime, scheduling operations, potentially executing some operators), the accelerator (GPU, TPU, etc., executing the bulk of the computation), system memory, device memory, and the interconnect (like PCIe or NVLink) that links them. Optimizing a single kernel might yield little overall benefit if the true bottleneck lies in data preparation on the CPU or transfer times over the interconnect. System-level profiling tools are designed to capture the concurrent activity across these components, presenting a unified timeline of execution.
At the system level, you should focus on observing:
Several tools provide system-wide performance visibility. The choice often depends on the specific hardware and software environment.
Linux perf
: A versatile and powerful tool integrated into the Linux kernel. perf
can sample CPU performance counters, trace kernel events, profile specific processes, and provide insights into CPU cache behavior, branch prediction, and system calls. While primarily CPU-focused, its ability to trace kernel events (like scheduling and I/O) provides context for system-wide behavior. It's particularly useful for diagnosing CPU-bound scenarios or high kernel/driver overhead.
NVIDIA Nsight Systems (nsys
): This is the standard tool for system-level performance analysis on NVIDIA GPU platforms. Its primary strength lies in generating a detailed timeline correlating CPU thread activity, CUDA API calls, CUDA kernel executions, memory transfers (memcpy, page migration), NVLink traffic, OS runtime events, and even DirectX or Vulkan calls if applicable. This unified view is invaluable for pinpointing interactions causing stalls or inefficiencies, such as CPU waits for GPU results, GPU waiting for CPU data, or PCIe bandwidth limitations.
AMD ROCm Developer Tools (Radeon GPU Profiler, rocprof
): For AMD GPU platforms using ROCm, rocprof
provides command-line profiling capabilities, collecting kernel execution times, API call traces (HIP), and performance counters. The Radeon GPU Profiler (RGP) provides a GUI for visualizing this data, offering timeline views similar to Nsight Systems, showing kernel dispatches, memory transfers (HSA signals), and host API calls. This helps identify bottlenecks in the CPU-GPU interaction flow on AMD hardware.
Intel VTune Profiler: While exceptionally powerful for deep CPU and Intel GPU microarchitectural analysis (covered later), VTune also offers platform-level analysis modes. It can capture CPU activity, integrated/discrete Intel GPU usage, memory bandwidth, and system events, presenting them on a timeline. It's especially useful for workloads running on Intel CPUs and GPUs, providing insights into thread scheduling, memory access patterns across the platform, and potential I/O bottlenecks.
General System Monitoring Utilities: Tools like htop
, dstat
, vmstat
, and iostat
provide real-time or logged snapshots of CPU load, memory usage, disk I/O, and network activity. While less detailed for correlating specific ML operations, they are useful for quick checks of overall system health and identifying gross resource saturation (e.g., running out of RAM, constant disk swapping).
The timeline view generated by tools like Nsight Systems or Radeon GPU Profiler is often the most informative output. When analyzing these timelines for compiled ML workloads, look for patterns such as:
cudaMemcpy
, hipMemcpy
) between host and device dominate the timeline. This suggests the interconnect (PCIe) bandwidth might be a limiter, or that data movement could be reduced or overlapped with computation. Check the achieved bandwidth against the theoretical maximum.cudaStreamSynchronize
) that stall the CPU thread waiting for GPU results. While sometimes necessary, frequent or lengthy stalls indicate potential pipeline bubbles.A simplified timeline illustrating potential bottlenecks. Long durations for Preprocess (p2), HtoD Transfer (t1), or Wait (p4) compared to Kernel 1 (k1) suggest CPU, interconnect, or synchronization bottlenecks, respectively. Tools like Nsight Systems provide much more detailed versions of such timelines.
System-level profiling is the crucial first diagnostic step. It provides the context needed to understand where the most significant performance limitations lie within the complex interplay of hardware and software components executing your compiled model. The insights gained here will guide you towards using more specialized CPU and GPU kernel profilers, which we will discuss in the following sections, to analyze the root causes within specific compute-bound or memory-bound operations.
© 2025 ApX Machine Learning