Theory is essential, but performance optimization is fundamentally an empirical science. Let's put the concepts from this chapter into practice by profiling a simple, optimized machine learning model component. We'll simulate a scenario where a matrix multiplication (GEMM), a fundamental operation in many ML models, has been compiled and optimized by an ML compiler targeting an NVIDIA GPU. Our goal is to use NVIDIA Nsight Compute to analyze its performance characteristics.
Assume we have a compiled executable, gemm_optimized
, which performs a matrix multiplication C=A×B, where A, B, and C are large matrices. The ML compiler has applied optimizations like tiling, shared memory usage, and instruction scheduling to generate an efficient CUDA kernel.
Our target hardware is an NVIDIA GPU (e.g., an Ampere or Hopper architecture GPU). The primary tool we'll use is NVIDIA Nsight Compute (ncu
), a detailed kernel profiler.
Nsight Compute can be used from the command line or via its GUI. For automated analysis or scripting, the command line is often preferred. To capture a detailed profile, we can execute our compiled program under ncu
.
# Ensure CUDA toolkit binaries are in your PATH
# Example: Profile the executable 'gemm_optimized'
# --set full: Collect a comprehensive set of metrics (can be time-consuming)
# -o profile_report: Save the report to a file named 'profile_report.ncu-rep'
# ./gemm_optimized: The executable to profile
ncu --set full -o profile_report ./gemm_optimized
This command runs gemm_optimized
and gathers detailed performance data for every CUDA kernel launched by the application, saving it to profile_report.ncu-rep
. For quicker analysis focused on specific aspects, you can use predefined metric sets (e.g., --set roofline
, --set memory
) or specify individual metrics.
You can open the profile_report.ncu-rep
file using the Nsight Compute GUI (nv-nsight-cu
) or analyze aspects directly via the command line (ncu --query-metrics ... profile_report.ncu-rep
). Let's focus on the key areas typically examined in the GUI or a detailed CLI report.
The report will list all CUDA kernels launched. Identify the primary kernel responsible for the matrix multiplication. It might be named something suggestive like gemm_kernel
, matmul_core
, or a mangled name derived from the compiler's internal representation. Focus your analysis on this kernel, especially if it consumes the majority of the GPU execution time.
GPU Speed Of Light (SOL) Throughput:
This simplified roofline chart shows two hypothetical kernels. Kernel A (red) operates below both memory and compute roofs, suggesting potential inefficiencies. Kernel B (blue) is closer to the compute roof, indicating it's likely compute-bound.
Occupancy:
Achieved Occupancy
and identifies limiters: Blocks per SM
, Registers per Thread
, Shared Memory per Block
.Instruction Stats:
Issue Slot Utilization
(how many instruction issue slots were used) and Executed Instructions per Clock
(IPC).sqrt
or transcendental functions), or insufficient parallelism exposed by the compiler's scheduling. High control-flow divergence (different threads in a warp taking different paths) can significantly degrade performance.Memory Workload Analysis:
Memory Throughput
show achieved bandwidth compared to peak. Look at the breakdown of memory operations (Global, Local, Shared).Comparison of achieved versus peak theoretical bandwidth at different memory levels. High utilization at the DRAM level suggests the kernel is likely memory-bound.
Source / Assembly Correlation (Optional but Powerful):
Profiling is rarely a one-shot process. Based on the analysis:
This practical exercise demonstrates how profiling tools bridge the gap between high-level ML models and low-level hardware execution. By systematically analyzing performance metrics provided by tools like Nsight Compute, you can diagnose bottlenecks introduced or unresolved by the compiler and runtime, guiding further optimization efforts to achieve maximum performance for your ML workloads. Remember to consult the specific documentation for your chosen profiler and hardware for detailed metric definitions and advanced features.
© 2025 ApX Machine Learning