Okay, let's transition from the theoretical aspects of code generation to the practical analysis of the output. Understanding the low-level code produced by the compiler for your GPU kernels is essential for verifying optimization effectiveness and diagnosing performance bottlenecks that might not be obvious from source code or even high-level profiling. This practical exercise guides you through analyzing generated GPU assembly, using NVIDIA's PTX (Parallel Thread Execution) as the primary example, though the principles apply to AMD's GCN/RDNA ISA as well.
To follow along, you'll need access to tools capable of generating and inspecting GPU assembly. This typically involves:
nvcc
(the compiler driver) and nvdisasm
(the disassembler).rocobjdump
.You can often instruct the ML compiler framework or nvcc
directly to output the intermediate PTX assembly or the final machine code (SASS for NVIDIA, ISA for AMD). For nvcc
, flags like nvcc my_kernel.cu -ptx -o kernel.ptx
or nvcc my_kernel.cu -cubin -o kernel.cubin
followed by nvdisasm kernel.cubin
are common. ML frameworks often have configuration options or environment variables to dump generated code.
Let's assume you've used your ML compiler toolchain to compile a simple C=A×B matrix multiplication kernel, targeting an NVIDIA GPU with CUDA capabilities. The compiler might have applied optimizations like tiling to leverage shared memory and potentially targeted Tensor Cores if available and enabled. You would typically obtain a .ptx
file or use nvdisasm
on a compiled .cubin
or executable to view the SASS. We will focus primarily on PTX for this analysis, as it's a stable intermediate assembly language.
PTX provides a virtual instruction set and register architecture. Analyzing it helps understand the compiler's high-level strategy before the final mapping to specific hardware microarchitectures (SASS).
Consider a hypothetical, simplified PTX snippet for a tiled MatMul kernel:
.version 7.5
.target sm_80 // Target architecture (e.g., Ampere)
.address_size 64
.visible .entry _Z10matmul_kerPfS_S_ii(
.param .u64 .ptr .align 8 .global _Z10matmul_kerPfS_S_ii_param_0, // Pointer C
.param .u64 .ptr .align 8 .global _Z10matmul_kerPfS_S_ii_param_1, // Pointer A
.param .u64 .ptr .align 8 .global _Z10matmul_kerPfS_S_ii_param_2, // Pointer B
.param .u32 _Z10matmul_kerPfS_S_ii_param_3, // Dim M
.param .u32 _Z10matmul_kerPfS_S_ii_param_4 // Dim N
)
{
.reg .pred %p<3>;
.reg .b32 %r<65>; // Example: 64 general-purpose 32-bit registers declared
.reg .f32 %f<33>; // Example: 32 floating-point 32-bit registers declared
.reg .b64 %rd<10>;
.shared .align 16 .b8 __shared_mem_A[4096]; // Shared memory for tile A
.shared .align 16 .b8 __shared_mem_B[4096]; // Shared memory for tile B
// --- Kernel Body Starts ---
ld.param.u64 %rd1, [_Z10matmul_kerPfS_S_ii_param_0]; // Load C ptr
ld.param.u64 %rd2, [_Z10matmul_kerPfS_S_ii_param_1]; // Load A ptr
// ... calculate thread/block indices using %tid, %ctaid ...
// --- Example: Load Tile from Global to Shared Memory ---
mov.u32 %r1, %tid.x;
// ... calculations for global memory addresses based on %r1, %ctaid, etc...
ld.global.cs.f32 %f1, [%rd_A_global_addr]; // Coalesced global load for A
st.shared.f32 [%shared_mem_A_addr], %f1; // Store to shared memory A
// ... similar load for tile B ...
// Synchronization barrier
bar.sync 0;
// --- Example: Compute using Shared Memory Tiles ---
// Assume registers %f10-%f17 hold accumulators for C tile
// Loop over tile dimension K
.L_K_Loop:
ld.shared.f32 %f2, [%shared_mem_A_addr_k];
ld.shared.f32 %f3, [%shared_mem_B_addr_k];
fma.rn.f32 %f10, %f2, %f3, %f10; // Fused multiply-add
// ... more FMAs for the output tile ...
// ... increment shared memory pointers, loop counter ...
setp.lt.u32 %p1, %r_k_counter, %r_K_dim;
@%p1 bra .L_K_Loop;
// --- Synchronization before next tile load (if applicable) ---
bar.sync 0;
// --- Store results from registers to Global Memory C ---
// ... calculate global memory address for C ...
st.global.cs.f32 [%rd_C_global_addr], %f10;
ret;
}
Key Areas for Analysis:
Resource Usage:
.reg .b32 %r<65>; .reg .f32 %f<33>;
: Count the declared registers. High register usage per thread (e.g., >64 or >128 depending on architecture) directly impacts occupancy. Occupancy is the ratio of active warps to the maximum possible warps per multiprocessor (SM). Lower occupancy can sometimes hide latency but often leads to underutilization of the SM's compute resources if there isn't enough instruction-level parallelism..shared .b8 __shared_mem_A[4096];
: Note the amount of shared memory allocated. Like registers, shared memory is a limited resource per SM. The combined requirement of registers and shared memory per block determines the maximum number of blocks that can concurrently run on an SM.Instructions:
ld.global
, st.global
: These access the main GPU device memory (DRAM). Look for patterns. Are they marked .cs
(cache streaming) or .cg
(cache global)? Are accesses likely to be coalesced (threads in a warp accessing contiguous memory locations)? Uncoalesced access drastically reduces effective bandwidth. Frequent global access suggests poor data reuse.ld.shared
, st.shared
: Accesses to the fast, on-chip shared memory. Their presence usually indicates a tiling strategy implemented by the compiler to improve data reuse.fma.rn.f32
(FP32 FMA), mul.f32
, add.f32
: Standard floating-point arithmetic. The ratio of compute instructions to memory instructions gives an idea of the kernel's arithmetic intensity.mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32
: This (or similar variants) indicates the use of Tensor Cores (on NVIDIA Volta and later). Its presence confirms the compiler successfully vectorized the operation to use these specialized units, offering significant speedups for specific matrix operations. The shape (m16n8k8
), data types (f16
, f32
), and layout (row
, col
) are important details.bar.sync
: Synchronization barriers. Essential for correctness when threads within a block share data via shared memory. Overuse or poorly placed barriers can serialize execution and create stalls.bra
, setp
, @%p1
: Control flow instructions. Complex branching, especially if divergent within a warp (threads in the same warp taking different paths), can significantly degrade performance. Look for loops (bra
back to a label) and conditional execution (@%p
predicate guard).Inferred Optimizations:
ld.global
-> st.shared
-> bar.sync
-> loop with ld.shared
and compute -> bar.sync
strongly suggests tiling.Static analysis of the assembly provides hypotheses about performance:
High Register/Shared Memory Usage: May lead to low occupancy. While not always bad, it can be a bottleneck if compute units are starved. You can use vendor documentation or online calculators to estimate theoretical occupancy based on these resources per block and the target SM's limits.
Example calculation of theoretical occupancy based on registers per thread and shared memory per block for a hypothetical GPU SM. Actual occupancy depends on launch configuration and runtime behavior.
Dominance of Global Memory Access: Suggests the kernel is memory-bandwidth bound. Check for coalescing and potential for better caching or tiling.
Absence of Expected Specialized Instructions: If you expect Tensor Core usage (mma.sync
) for a MatMul on compatible hardware but see only fma.f32
, the compiler might not have recognized the pattern or met the alignment/dimension constraints required for Tensor Core code generation.
Significant Branching: Especially divergent branches within loops can be costly.
While PTX offers valuable insights, the actual machine code (SASS for NVIDIA, ISA for AMD) reveals the final instruction scheduling, register allocation choices, and potential hardware-specific bottlenecks like bank conflicts in shared memory. Tools like nvdisasm
(for SASS) or rocobjdump
(for GCN/RDNA ISA) allow inspection at this level. The analysis principles remain similar: examine resource usage, instruction mix, memory access patterns, and control flow, but map them to the specific hardware architecture's capabilities and limitations. Analyzing AMD's ISA involves understanding VLIW (Vector Instruction Level Word) characteristics, scalar vs. vector instructions, and wavefront execution.
Analyzing generated GPU kernel assembly is a powerful technique for advanced compiler developers and performance engineers. It moves beyond abstractions to reveal precisely how the compiler translated high-level operations and applied optimizations. By examining register usage, shared memory allocation, instruction sequences, memory access patterns, and control flow in PTX or SASS/ISA, you can gain deep insights into the effectiveness of the code generation process, verify the use of specialized hardware units, and form concrete hypotheses about performance bottlenecks that can then be confirmed using profiling tools (as discussed in Chapter 9). This low-level visibility is often necessary to achieve maximum performance from complex ML workloads on heterogeneous hardware.
© 2025 ApX Machine Learning