All Courses

Hands-on Practical: Analyzing Generated GPU Kernels

Okay, let's transition from the theoretical aspects of code generation to the practical analysis of the output. Understanding the low-level code produced by the compiler for your GPU kernels is essential for verifying optimization effectiveness and diagnosing performance bottlenecks that might not be obvious from source code or even high-level profiling. This practical exercise guides you through analyzing generated GPU assembly, using NVIDIA's PTX (Parallel Thread Execution) as the primary example, though the principles apply to AMD's GCN/RDNA ISA as well.

Prerequisites and Setup

To follow along, you'll need access to tools capable of generating and inspecting GPU assembly. This typically involves:

An ML Compiler: A system like TVM, XLA (via TensorFlow), or a custom compiler capable of targeting GPUs.
GPU Vendor Toolkits:
- For NVIDIA: The CUDA Toolkit, which includes nvcc (the compiler driver) and nvdisasm (the disassembler).
- For AMD: The ROCm stack, including its compiler and tools like rocobjdump.
A Sample Kernel: We'll use a simple matrix multiplication (MatMul) or a convolution layer as our example. Assume you have compiled such an operation targeting your GPU.

You can often instruct the ML compiler framework or nvcc directly to output the intermediate PTX assembly or the final machine code (SASS for NVIDIA, ISA for AMD). For nvcc, flags like nvcc my_kernel.cu -ptx -o kernel.ptx or nvcc my_kernel.cu -cubin -o kernel.cubin followed by nvdisasm kernel.cubin are common. ML frameworks often have configuration options or environment variables to dump generated code.

Generating the Target Code

Let's assume you've used your ML compiler toolchain to compile a simple $C = A \times B$ matrix multiplication kernel, targeting an NVIDIA GPU with CUDA capabilities. The compiler might have applied optimizations like tiling to leverage shared memory and potentially targeted Tensor Cores if available and enabled. You would typically obtain a .ptx file or use nvdisasm on a compiled .cubin or executable to view the SASS. We will focus primarily on PTX for this analysis, as it's a stable intermediate assembly language.

Analyzing PTX Assembly

PTX provides a virtual instruction set and register architecture. Analyzing it helps understand the compiler's high-level strategy before the final mapping to specific hardware microarchitectures (SASS).

Consider a simplified PTX snippet for a tiled MatMul kernel:

.version 7.5
.target sm_80 // Target architecture (e.g., Ampere)
.address_size 64

.visible .entry _Z10matmul_kerPfS_S_ii(
    .param .u64 .ptr .align 8 .global _Z10matmul_kerPfS_S_ii_param_0, // Pointer C
    .param .u64 .ptr .align 8 .global _Z10matmul_kerPfS_S_ii_param_1, // Pointer A
    .param .u64 .ptr .align 8 .global _Z10matmul_kerPfS_S_ii_param_2, // Pointer B
    .param .u32 _Z10matmul_kerPfS_S_ii_param_3, // Dim M
    .param .u32 _Z10matmul_kerPfS_S_ii_param_4  // Dim N
)
{
    .reg .pred      %p<3>;
    .reg .b32       %r<65>; // Example: 64 general-purpose 32-bit registers declared
    .reg .f32       %f<33>; // Example: 32 floating-point 32-bit registers declared
    .reg .b64       %rd<10>;

    .shared .align 16 .b8    __shared_mem_A[4096]; // Shared memory for tile A
    .shared .align 16 .b8    __shared_mem_B[4096]; // Shared memory for tile B

    // --- Kernel Body Starts ---
    ld.param.u64    %rd1, [_Z10matmul_kerPfS_S_ii_param_0]; // Load C ptr
    ld.param.u64    %rd2, [_Z10matmul_kerPfS_S_ii_param_1]; // Load A ptr
    // ... calculate thread/block indices using %tid, %ctaid ...

    // --- Example: Load Tile from Global to Shared Memory ---
    mov.u32         %r1, %tid.x;
    // ... calculations for global memory addresses based on %r1, %ctaid, etc...
    ld.global.cs.f32 %f1, [%rd_A_global_addr]; // Coalesced global load for A
    st.shared.f32   [%shared_mem_A_addr], %f1; // Store to shared memory A

    // ... similar load for tile B ...

    // Synchronization barrier
    bar.sync 0;

    // --- Example: Compute using Shared Memory Tiles ---
    // Assume registers %f10-%f17 hold accumulators for C tile

    // Loop over tile dimension K
.L_K_Loop:
    ld.shared.f32   %f2, [%shared_mem_A_addr_k];
    ld.shared.f32   %f3, [%shared_mem_B_addr_k];
    fma.rn.f32      %f10, %f2, %f3, %f10; // Fused multiply-add

    // ... more FMAs for the output tile ...
    // ... increment shared memory pointers, loop counter ...
    setp.lt.u32     %p1, %r_k_counter, %r_K_dim;
    @%p1 bra        .L_K_Loop;

    // --- Synchronization before next tile load (if applicable) ---
    bar.sync 0;

    // --- Store results from registers to Global Memory C ---
    // ... calculate global memory address for C ...
    st.global.cs.f32 [%rd_C_global_addr], %f10;

    ret;
}

Areas for Analysis:

Resource Usage:
- .reg .b32 %r<65>; .reg .f32 %f<33>;: Count the declared registers. High register usage per thread (e.g., >64 or >128 depending on architecture) directly impacts occupancy. Occupancy is the ratio of active warps to the maximum possible warps per multiprocessor (SM). Lower occupancy can sometimes hide latency but often leads to underutilization of the SM's compute resources if there isn't enough instruction-level parallelism.
- .shared .b8 __shared_mem_A[4096];: Note the amount of shared memory allocated. Like registers, shared memory is a limited resource per SM. The combined requirement of registers and shared memory per block determines the maximum number of blocks that can concurrently run on an SM.
Instructions:
- ld.global, st.global: These access the main GPU device memory (DRAM). Look for patterns. Are they marked .cs (cache streaming) or .cg (cache global)? Are accesses likely to be coalesced (threads in a warp accessing contiguous memory locations)? Uncoalesced access drastically reduces effective bandwidth. Frequent global access suggests poor data reuse.
- ld.shared, st.shared: Accesses to the fast, on-chip shared memory. Their presence usually indicates a tiling strategy implemented by the compiler to improve data reuse.
- fma.rn.f32 (FP32 FMA), mul.f32, add.f32: Standard floating-point arithmetic. The ratio of compute instructions to memory instructions gives an idea of the kernel's arithmetic intensity.
- mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32: This (or similar variants) indicates the use of Tensor Cores (on NVIDIA Volta and later). Its presence confirms the compiler successfully vectorized the operation to use these specialized units, offering significant speedups for specific matrix operations. The shape (m16n8k8), data types (f16, f32), and layout (row, col) are important details.
- bar.sync: Synchronization barriers. Essential for correctness when threads within a block share data via shared memory. Overuse or poorly placed barriers can serialize execution and create stalls.
- bra, setp, @%p1: Control flow instructions. Complex branching, especially if divergent within a warp (threads in the same warp taking different paths), can significantly degrade performance. Look for loops (bra back to a label) and conditional execution (@%p predicate guard).
Inferred Optimizations:
- Tiling: The pattern of ld.global -> st.shared -> bar.sync -> loop with ld.shared and compute -> bar.sync strongly suggests tiling.
- Loop Unrolling: You might see sequences of similar instructions repeated with adjusted immediate offsets or register indices instead of a tight loop structure, indicating the compiler unrolled loops to reduce overhead and improve instruction scheduling.
- Instruction Scheduling: The interleaving of memory access instructions and compute instructions attempts to hide memory latency. The compiler tries to issue loads early enough so data is available when needed by compute instructions.

Connecting Analysis to Performance

Static analysis of the assembly provides hypotheses about performance:

High Register/Shared Memory Usage: May lead to low occupancy. While not always bad, it can be a bottleneck if compute units are starved. You can use vendor documentation or online calculators to estimate theoretical occupancy based on these resources per block and the target SM's limits.

Example calculation of theoretical occupancy based on registers per thread and shared memory per block for a GPU SM. Actual occupancy depends on launch configuration and runtime behavior.
Dominance of Global Memory Access: Suggests the kernel is memory-bandwidth bound. Check for coalescing and potential for better caching or tiling.
Absence of Expected Specialized Instructions: If you expect Tensor Core usage (mma.sync) for a MatMul on compatible hardware but see only fma.f32, the compiler might not have recognized the pattern or met the alignment/dimension constraints required for Tensor Core code generation.
Significant Branching: Especially divergent branches within loops can be costly.

In PTX: SASS and Other Architectures

While PTX offers valuable insights, the actual machine code (SASS for NVIDIA, ISA for AMD) reveals the final instruction scheduling, register allocation choices, and potential hardware-specific bottlenecks like bank conflicts in shared memory. Tools like nvdisasm (for SASS) or rocobjdump (for GCN/RDNA ISA) allow inspection at this level. The analysis principles remain similar: examine resource usage, instruction mix, memory access patterns, and control flow, but map them to the specific hardware architecture's capabilities and limitations. Analyzing AMD's ISA involves understanding VLIW (Vector Instruction Level Word) characteristics, scalar vs. vector instructions, and wavefront execution.

Conclusion

Analyzing generated GPU kernel assembly is a powerful technique for advanced compiler developers and performance engineers. It moves past abstractions to reveal precisely how the compiler translated high-level operations and applied optimizations. By examining register usage, shared memory allocation, instruction sequences, memory access patterns, and control flow in PTX or SASS/ISA, you can gain deep insights into the effectiveness of the code generation process, verify the use of specialized hardware units, and form concrete ideas about performance bottlenecks that can then be confirmed using profiling tools (as discussed in Chapter 9). This low-level visibility is often necessary to achieve maximum performance from complex ML workloads on heterogeneous hardware.

Was this section helpful?