While system-level profilers provide a valuable overview of where time is spent across the CPU, GPU, and interconnect, they often treat GPU kernel execution as opaque blocks of time. When an ML compiler fuses multiple operations or applies complex loop transformations and generates a custom kernel, system-level tools might simply show one long-running kernel. To understand why that kernel is slow or inefficient, you need to look inside it using specialized GPU kernel profilers. These tools provide fine-grained performance data tied directly to the GPU's hardware architecture.
For NVIDIA GPUs, the primary tool is NVIDIA Nsight Compute. For AMD GPUs using the ROCm stack, the corresponding tool is ROCprof, often used in conjunction with the Radeon GPU Profiler (RGP) for visualization. These profilers allow you to collect detailed hardware performance counters and metrics for individual kernel launches.
Why Kernel-Level Profiling is Necessary
Optimized ML kernels, especially those generated by advanced compilers, push the limits of GPU hardware. Performance limitations can arise from various factors internal to the kernel's execution:
- Occupancy Limits: Is the GPU's compute capacity fully utilized? Low occupancy, meaning fewer active warps (NVIDIA) or wavefronts (AMD) per multiprocessor than theoretically possible, can limit the ability to hide memory latency and keep execution units busy. This might be caused by high register usage per thread, excessive shared memory (LDS on AMD) allocation, or thread block sizing choices made by the compiler.
- Instruction Bottlenecks: Is the kernel bottlenecked by specific types of instructions? Profilers can break down the instruction mix, revealing if the kernel is spending most of its time on floating-point math (FP32, FP16, TF32), integer operations, memory access instructions, or control flow (branches). Targeting specialized units like Tensor Cores (NVIDIA) or Matrix Cores (AMD) is important, and profilers can verify if these units are being used effectively.
- Memory Hierarchy Inefficiencies: Is the kernel efficiently using the GPU's caches (L1, L2) and accessing DRAM? High L1/L2 miss rates combined with high DRAM bandwidth usage indicate memory-bound behavior. Kernel profilers provide detailed metrics on cache hit rates, memory throughput requested versus achieved, and the latency incurred by memory operations.
- Execution Stalls: Are warps/wavefronts frequently stalled, waiting for resources? Profilers can pinpoint the primary reasons for stalls, such as waiting for data from memory (memory latency), waiting for results from previous instructions (data dependencies), synchronization barriers, or suboptimal instruction scheduling.
NVIDIA Nsight Compute
Nsight Compute focuses on analyzing individual CUDA kernel launches. It collects a wide range of performance counters from the GPU hardware and presents them through different "sections," each focusing on a specific aspect of performance.
Key Concepts and Metrics:
- Sections: Organized views of performance data, such as "GPU Speed of Light" (comparing achieved performance against theoretical maximums), "Occupancy," "Memory Workload Analysis," "Scheduler Statistics," and "Warp State Statistics."
- Source Correlation: Nsight Compute attempts to correlate performance metrics back to the source code (CUDA C++, PTX assembly, or potentially even higher-level IRs if debug information is available). This helps pinpoint which lines of code are responsible for performance issues.
- Occupancy: Reports Theoretical Occupancy (maximum possible based on hardware limits) and Achieved Occupancy (actual average active warps per SM during the kernel execution). Low achieved occupancy relative to theoretical often signals resource limitations (registers, shared memory).
- Memory Throughput: Provides detailed breakdowns of data requested and transferred between different levels of the memory hierarchy (L1, L2, DRAM). Comparing requested vs. achieved bandwidth helps identify memory bottlenecks.
Comparing two hypothetical kernels. Kernel A shows a moderate L1 hit rate but high DRAM bandwidth utilization, suggesting it might be memory-bound at the DRAM level. Kernel B has a high L1 hit rate and low DRAM usage, indicating better cache locality and potentially being compute-bound.
- Instruction Statistics: Breaks down the mix of executed instructions (FP32, FP64, INT, memory, control flow, Tensor Core) and calculates metrics like Instructions Per Clock (IPC). Low IPC can indicate frequent stalls.
- Stall Reasons: Quantifies the reasons why warps were stalled (e.g.,
Wait
for memory dependencies, Instruction Fetch
, Execution Dependency
, Barrier
). This is vital for understanding latency bottlenecks.
Workflow:
Nsight Compute can be used via its command-line interface (ncu
) or a graphical user interface.
- Profile: Launch your compiled ML application under
ncu
(e.g., ncu --set full -o profile_report ./my_ml_app
) or attach the GUI to the running process.
- Collect: Nsight Compute intercepts kernel launches, reruns them if necessary to collect different sets of counters (due to hardware limitations on simultaneous counter collection), and gathers the data.
- Analyze: Examine the generated report file (
.ncu-rep
) in the GUI or analyze the command-line output. Start with high-level sections like "GPU Speed of Light" and "Occupancy," then use detailed sections like "Memory Workload Analysis" or "Scheduler Statistics" to investigate specific bottlenecks identified. Use source correlation to map issues back to code regions.
AMD ROCprof and Radeon GPU Profiler (RGP)
For AMD GPUs running the ROCm software stack, rocprof
is the command-line tool for collecting kernel performance counters. The results are often visualized and analyzed using the Radeon GPU Profiler (RGP), although rocprof
can also output raw data (e.g., CSV).
Key Concepts and Metrics:
- Counters: ROCprof relies on collecting specific hardware performance counters available on the AMD GPU architecture (e.g.,
FETCH_SIZE
, WRITE_SIZE
for memory, VALU_UTILIZATION
, SALU_UTILIZATION
for compute units, L2CacheHit
for cache performance).
- Wavefronts: The fundamental unit of execution on AMD GPUs, analogous to NVIDIA's warps. Occupancy metrics relate to the number of active wavefronts per Compute Unit (CU).
- LDS (Local Data Share): On-chip memory analogous to NVIDIA's shared memory. High LDS usage can limit wavefront occupancy.
- VGPRs/SGPRs: Vector and Scalar General Purpose Registers. High usage can limit occupancy.
Core Metrics (often derived from counters):
- Occupancy: Similar to NVIDIA, measures the ratio of active wavefronts to the maximum possible per CU, limited by resources like VGPRs, SGPRs, and LDS.
- Compute Utilization: Metrics like
VALUUtilization
show how busy the main vector execution units are.
- Memory Bandwidth: Counters track data movement between levels of the memory hierarchy (Vector Memory, L1, L2, HBM). Metrics derived include achieved bandwidth.
- Cache Performance: Counters like
L2CacheHit
directly measure cache efficiency.
- Instruction Mix: Counters can provide insights into the mix of vector ALU, scalar ALU, memory, and branch instructions.
Workflow:
- Identify Counters: Determine which hardware counters are needed to investigate a potential bottleneck (e.g., memory counters if suspecting memory bounds, VALU counters for compute). You can list available counters using
rocprof --list-basic
or --list-derived
.
- Profile: Run your application using
rocprof
, specifying the counters or metrics to collect. A typical invocation might look like: rocprof --stats -o results.csv ./my_ml_app
. For more detailed trace analysis suitable for RGP, you might use flags to generate an .sqtt
file.
- Analyze:
- CSV/Text Output: Analyze the generated
results.csv
or console output, calculating relevant metrics (e.g., cache hit rates, bandwidth) and comparing kernel performance.
- RGP Visualization: Load the generated trace file (e.g.,
.sqtt
, .rgp
) into the Radeon GPU Profiler. RGP provides a timeline view, kernel duration analysis, wavefront occupancy charts, and detailed hardware counter views correlated with kernel execution phases. This offers a powerful way to visualize execution flow and bottlenecks within a kernel.
Interpreting Profiler Data Effectively
Collecting data is only the first step. Interpreting it correctly requires context:
- Know Your Hardware: Understand the theoretical peak performance (FLOPS, memory bandwidth) and architectural limits (cache sizes, register file size, maximum occupancy) of your target GPU. Compare achieved metrics against these theoreticals. A kernel achieving 80% of peak DRAM bandwidth is likely memory-bound.
- Identify the Limiter: Use the profiler data to determine the primary bottleneck. Is occupancy low due to registers? Is memory bandwidth saturated? Are execution units underutilized? Are there significant stalls due to specific dependencies?
- Correlate Back: Whenever possible, use the source correlation features (Nsight Compute) or analyze the kernel's behavior in RGP alongside the compiler's intermediate representation (e.g., MLIR, LLVM IR) or generated assembly (PTX/GCN). This helps connect low-level issues (like high register pressure) to specific high-level operations or compiler decisions (like aggressive fusion leading to large kernels).
- Iterate: Performance analysis is an iterative process. Use the insights gained from profiling to guide further optimization efforts. This might involve adjusting compiler flags (e.g., controlling fusion aggressiveness, enabling/disabling specific optimizations), modifying the ML model structure slightly, or even implementing custom kernels. After making changes, re-profile to measure the impact.
By using Nsight Compute and ROCprof/RGP, you gain the necessary visibility into the execution of compiled ML kernels on the GPU. This detailed analysis is indispensable for diagnosing stubborn performance issues and ensuring that the sophisticated optimizations performed by the compiler are translating into real-world speedups on the target hardware.