High-performance deep learning kernels are rarely limited by the arithmetic capability of the processor. Instead, they are bounded by the speed at which data moves from memory to the compute units. When compiling neural network graphs to hardware, the primary engineering challenge is masking memory latency. A standard NVIDIA A100 GPU can perform floating-point operations at a rate of hundreds of teraFLOPS, yet its global memory bandwidth is limited to roughly 1.5 - 2 TB/s. Without explicit management of the memory hierarchy, the compute units (Tensor Cores) stall, waiting for operands to arrive.Hardware-aware code generation requires mapping the abstract concept of a "loop tile" or "tensor slice" to specific physical memory spaces. The GPU memory hierarchy is a pyramid of increasing bandwidth and decreasing capacity. The compiler must determine which data resides in global memory, which subsets form the working set in shared memory, and which values sit in registers for immediate computation.The GPU Memory ModelThe GPU memory architecture exposes three distinct address spaces that a compiler must explicitly manage. Unlike a CPU where cache management is largely transparent to the software (hardware-managed), GPUs rely on the compiler to stage data explicitly into the faster, on-chip memory tiers.Global Memory: The largest and slowest tier. It resides in high-bandwidth memory (HBM) off-chip. It is persistent across kernel launches and accessible by all threads.L2 Cache: A hardware-managed unified cache shared by all Streaming Multiprocessors (SMs). It buffers access to global memory but is not directly addressable by the programmer.Shared Memory (SRAM): A fast, low-latency memory scratchpad located physically on the SM. It is shared among threads in a single thread block. This is the critical tier for data reuse in matrix multiplication and convolution.Registers: The fastest memory, private to a single thread. Operands must be in registers to be used by the ALUs or Tensor Cores.The following diagram illustrates the relationship between the thread hierarchy and the memory hierarchy, showing visibility and access scopes.digraph G { rankdir=TB; bgcolor="#ffffff"; node [style=filled, fontname="Helvetica", shape=box, penwidth=0]; subgraph cluster_device { label="GPU Device (Grid Scope)"; style=filled; color="#f8f9fa"; fontcolor="#495057"; GlobalMem [label="Global Memory (HBM)\n~2 TB/s | High Latency", fillcolor="#a5d8ff", fontcolor="#1c7ed6"]; L2 [label="L2 Cache\nHardware Managed", fillcolor="#d0bfff", fontcolor="#7048e8"]; subgraph cluster_sm { label="Streaming Multiprocessor (Block Scope)"; style=filled; color="#e9ecef"; SharedMem [label="Shared Memory (SRAM)\n~19 TB/s | Low Latency", fillcolor="#96f2d7", fontcolor="#0ca678"]; subgraph cluster_thread { label="Thread Scope"; style=filled; color="#dee2e6"; Registers [label="Registers (RF)\n~Low Latency", fillcolor="#ffc9c9", fontcolor="#fa5252"]; Core [label="Tensor Core / ALU", fillcolor="#ffec99", fontcolor="#f59f00"]; } } } GlobalMem -> L2 [dir=both, color="#adb5bd"]; L2 -> SharedMem [label=" Load/Store", color="#adb5bd", fontname="Helvetica", fontsize=10]; SharedMem -> Registers [label=" Load/Store", color="#adb5bd", fontname="Helvetica", fontsize=10]; Registers -> Core [label=" Compute", color="#adb5bd", fontname="Helvetica", fontsize=10]; }The hierarchy dictates visibility and speed. Data flows from Global Memory through the L2 cache into Shared Memory, where it is accessible to a Thread Block, before being loaded into Registers for private thread execution.Global Memory CoalescingThe first step in hierarchy mapping is moving data from Global Memory to the chip. The efficiency of this movement depends on memory coalescing.Global memory is accessed via transactions of 32, 64, or 128 bytes. When a warp (a group of 32 threads) executes a load instruction, the hardware examines the memory addresses requested by each thread. If the addresses are contiguous and aligned, the hardware coalesces these requests into the minimum number of transactions.For example, if thread $i$ accesses address $ptr + i \times 4$ (loading a 32-bit float), the 32 threads access a contiguous 128-byte block. This results in a single memory transaction, achieving 100% bus utilization.However, if the access pattern is strided, for instance, accessing a column in a row-major matrix, the addresses are spread out. Thread 0 might access address $X$, and Thread 1 accesses $X + \text{stride}$. Even though each thread only needs 4 bytes, the hardware may fetch a full 32-byte segment for each thread to satisfy the request. This phenomenon leads to "bandwidth waste," where the effective throughput is a fraction of the theoretical peak.Compilers typically handle this by performing vectorized loads. instead of each thread loading a single scalar, the compiler emits instructions (like LD.128 in PTX) where a thread loads a float4 vector. This increases the data transferred per instruction and helps amortize the cost of the memory request.Shared Memory TilingOnce data is fetched from global memory, it must be stored where it can be reused efficiently. This is the role of Shared Memory. In deep learning operations like Matrix Multiplication ($C = A \times B$), every element of $A$ and $B$ is accessed multiple times. Fetching them from global memory repeatedly would saturate bandwidth immediately.The compiler implements Tiling (or blocking) by loading a sub-matrix (tile) of $A$ and $B$ into shared memory. The threads then compute the partial product using these fast-access tiles before moving to the next tile.The chart below highlights the stark difference in bandwidth and latency between these layers, emphasizing why the tiling buffer is necessary.{"layout": {"title": {"text": "Bandwidth vs Latency Across Memory Hierarchy (Log Scale)", "font": {"family": "Helvetica", "size": 16, "color": "#333"}}, "xaxis": {"title": "Memory Tier", "showgrid": false}, "yaxis": {"title": "Value (Log Scale)", "type": "log", "gridcolor": "#dee2e6"}, "barmode": "group", "plot_bgcolor": "white", "paper_bgcolor": "white", "legend": {"x": 0.8, "y": 1}}, "data": [{"x": ["Global Memory", "L2 Cache", "Shared Memory"], "y": [1555, 3000, 19000], "name": "Bandwidth (GB/s)", "type": "bar", "marker": {"color": "#339af0"}}, {"x": ["Global Memory", "L2 Cache", "Shared Memory"], "y": [300, 100, 20], "name": "Latency (cycles)", "type": "bar", "marker": {"color": "#fa5252"}}]}Shared memory offers an order-of-magnitude improvement in bandwidth compared to global memory. The compiler's goal is to maximize the ratio of compute operations performed per byte loaded into this tier.To optimize this pipeline, modern compilers (like those using MLIR's NVGPU dialect) utilize asynchronous copy instructions (e.g., cp.async on NVIDIA Ampere and Hopper). These instructions allow the GPU to move data from Global Memory directly to Shared Memory without utilizing the register file as an intermediary. This reduces register pressure and allows the compute units to continue processing the current tile while the next tile is being loaded in the background (Software Pipelining).Register Allocation and PressureThe final stage of mapping is the Register File (RF). This is where the active accumulators for the matrix multiplication result ($C_{tile}$) reside.Registers are the scarcest resource on the GPU. A single Streaming Multiprocessor has a massive register file (e.g., 256 KB), but this space is partitioned among all active threads. If a kernel generated by the compiler requires too many registers per thread, the scheduler must reduce the number of active warps to fit them. This reduction is known as Register Pressure.High register pressure leads to low Occupancy. Occupancy is the ratio of active warps to the maximum number of supported warps. If occupancy drops too low, there are not enough active threads to hide the latency of memory operations. If a warp stalls waiting for a global memory load, the scheduler needs another ready warp to switch to immediately.When mapping IR to registers, the compiler performs Liveness Analysis to determine which variables can share the same physical register. In cases where the register requirement exceeds the hardware limit, the compiler is forced to "spill" registers to Local Memory. Despite the name, Local Memory resides physically in Global Memory (off-chip), so spilling results in a catastrophic performance penalty.Double BufferingTo maintain continuous execution, compilers implement Double Buffering (or multi-stage buffering).Load Phase: The compiler issues instructions to load Tile $i+1$ from Global into Shared Memory.Compute Phase: While the load is in flight, threads load data from Tile $i$ (which is already in Shared Memory) into registers and perform the math.Synchronization: A barrier (__syncthreads()) ensures Tile $i+1$ is fully loaded before the loop repeats.By overlapping the memory latency of the next tile with the arithmetic intensity of the current tile, the compiler ensures that the Tensor Cores are never idle. This requires precise control over the instruction schedule, often necessitating the use of low-level intermediates like LLVM IR or direct PTX generation rather than high-level source code.