Optimizing memory throughput involves more than just moving data into the fastest available storage. Once data resides in Shared Memory (SMem), the way threads access that data determines whether the kernel achieves peak bandwidth or stalls due to serialization. Shared memory in NVIDIA GPU architectures is divided into equally sized memory modules called banks, which can be accessed simultaneously. When multiple threads in a single warp attempt to read or write to different addresses within the same bank, a bank conflict occurs, forcing the hardware to replay the memory request sequentially.The Architecture of Shared Memory BanksShared memory is organized into 32 successive banks, corresponding to the 32 threads in a warp. Each bank has a bandwidth of 32 bits (4 bytes) per clock cycle. The mapping of a memory address to a bank is determined by the address modulus 32. Specifically, for a standard 32-bit word size, successive words are assigned to successive banks.If a warp executes a load instruction, the hardware inspects the addresses requested by the active threads. The optimal case is a conflict-free access, where every thread references a unique bank. This allows the memory controller to service all 32 requests in a single transaction.However, deep learning workloads often involve matrix multiplications and convolutions that require accessing multidimensional arrays. Depending on the stride and the tensor shape, these access patterns can inadvertently align to the same bank.The bank index for a given 32-bit address is calculated as:$$ \text{Bank Index} = \left( \frac{\text{Byte Address}}{4} \right) \pmod{32} $$When multiple threads map to the same bank index, the hardware splits the request into $n$ separate transactions, where $n$ is the maximum number of conflicts in any single bank. This implies that an $n$-way bank conflict reduces the effective shared memory bandwidth by a factor of $n$.Bank Access Patterns and Conflict Visualization. The left side demonstrates a linear access pattern where each thread maps to a unique bank (conflict-free). The right side shows a strided access pattern (stride 2) causing 2-way bank conflicts, as two threads map to the same bank simultaneously.digraph G { rankdir=TB; node [shape=box, style="filled", fontname="Arial", fontsize=10]; edge [fontname="Arial", fontsize=9]; subgraph cluster_linear { label = "Linear Access (Stride 1)\nConflict-Free"; style = filled; color = "#f8f9fa"; t0 [label="Thread 0", fillcolor="#a5d8ff", color="#a5d8ff"]; t1 [label="Thread 1", fillcolor="#a5d8ff", color="#a5d8ff"]; t2 [label="Thread 2", fillcolor="#a5d8ff", color="#a5d8ff"]; t3 [label="Thread 3", fillcolor="#a5d8ff", color="#a5d8ff"]; b0 [label="Bank 0", fillcolor="#b2f2bb", color="#b2f2bb"]; b1 [label="Bank 1", fillcolor="#b2f2bb", color="#b2f2bb"]; b2 [label="Bank 2", fillcolor="#b2f2bb", color="#b2f2bb"]; b3 [label="Bank 3", fillcolor="#b2f2bb", color="#b2f2bb"]; t0 -> b0; t1 -> b1; t2 -> b2; t3 -> b3; } subgraph cluster_strided { label = "Strided Access (Stride 2)\n2-Way Conflict"; style = filled; color = "#f8f9fa"; st0 [label="Thread 0", fillcolor="#ffc9c9", color="#ffc9c9"]; st1 [label="Thread 1", fillcolor="#ffc9c9", color="#ffc9c9"]; st2 [label="Thread 2", fillcolor="#ffc9c9", color="#ffc9c9"]; st3 [label="Thread 3", fillcolor="#ffc9c9", color="#ffc9c9"]; sb0 [label="Bank 0", fillcolor="#ffe066", color="#ffe066"]; sb1 [label="Bank 1", fillcolor="#ffe066", color="#ffe066"]; sb2 [label="Bank 2", fillcolor="#ffe066", color="#ffe066"]; sb3 [label="Bank 3", fillcolor="#ffe066", color="#ffe066"]; st0 -> sb0; st1 -> sb2 [style=dashed]; st2 -> sb0 [color="#fa5252", penwidth=2]; st3 -> sb2 [color="#fa5252", penwidth=2]; } }Identifying Conflicts in Tensor LayoutsIn tensor processing, data is typically stored in row-major format. Accessing a tile of data row-by-row usually results in unit-stride accesses, which are conflict-free. However, accessing a column within a tile often leads to strided access.Consider a tile of size $32 \times 32$ stored in shared memory with float elements. The stride between two vertical elements is 32 words. If a warp attempts to load a column (e.g., Thread $i$ loads $A[i][0]$), every thread accesses an address separated by exactly 32 words. Since $32 \pmod{32} = 0$, all 32 threads in the warp map to Bank 0. This causes a 32-way bank conflict, serializing the execution and reducing throughput to 1/32th of the peak capability.This scenario is common in GEMM (General Matrix Multiply) operations where one matrix is transposed or when loading tiles for Tensor Core consumption.Optimization Strategy: PaddingThe simplest method to resolve bank conflicts is padding. By adding a "dummy" column to the allocation, the physical stride changes while the logical access pattern remains the same. For a $32 \times 32$ tile, allocating it as $32 \times 33$ changes the stride from 32 to 33.With a stride of 33:Thread 0 accesses index 0 $\to$ Bank 0.Thread 1 accesses index 33 $\to$ Bank 1 ($33 \pmod{32} = 1$).Thread $k$ accesses index $33k \to$ Bank $k \pmod{32}$.Since $\text{gcd}(33, 32) = 1$, the accesses cycle through all banks, eliminating conflicts entirely. While effective, padding increases shared memory usage, which is a scarce resource. In highly optimized kernels where occupancy is limited by shared memory capacity, wasting 3% of space (1 in 32 columns) can be significant.Optimization Strategy: SwizzlingTo avoid the memory overhead of padding, modern compilers like TVM and the logic within Triton employ address swizzling (often called XOR swizzling). Swizzling permutes the storage location of elements without changing the allocation size.The idea is to modify the column index based on the row index using a bitwise XOR operation. A common swizzling pattern for a 2D tile is:$$ \text{col}{\text{phys}} = \text{col}{\text{logical}} \oplus \text{row}_{\text{logical}} $$Or, for wider tiles:$$ \text{col}{\text{phys}} = \text{col}{\text{logical}} \oplus \left( \frac{\text{row}_{\text{logical}}}{\text{TileWidth}} \right) $$By XORing the row bits into the column address, the data for each row is shifted cyclically across the banks. When threads read down a column (incrementing the row index), the physical column index shifts, distributing the accesses across different banks.This technique allows the compiler to maintain dense storage while ensuring that both row-major and column-major access patterns have reduced or zero bank conflicts.Throughput comparison of different memory access strategies. The visualization highlights the severe penalty of n-way conflicts and the restoration of performance using padding or swizzling techniques.{"layout": {"title": {"text": "Effective Shared Memory Bandwidth vs. Access Pattern", "font": {"family": "Arial", "size": 16}}, "xaxis": {"title": {"text": "Access Strategy", "font": {"family": "Arial"}}}, "yaxis": {"title": {"text": "Normalized Bandwidth", "font": {"family": "Arial"}}, "range": [0, 1.1]}, "margin": {"l": 60, "r": 40, "t": 40, "b": 40}, "plot_bgcolor": "#f8f9fa", "paper_bgcolor": "white", "width": 600, "height": 400}, "data": [{"type": "bar", "x": ["Linear (Unit Stride)", "Strided (32-way Conflict)", "Padded Memory", "XOR Swizzled"], "y": [1.0, 0.031, 1.0, 1.0], "marker": {"color": ["#51cf66", "#fa5252", "#339af0", "#cc5de8"]}}]}Vectorized Loads and Bank WidthAdvanced compilation strategies must also account for vectorized instructions, such as LDS.128 (loading 128 bits or 4 floats at once). While the basic bank width is 32 bits, the hardware can gang banks together to service wider loads.However, vectorization increases the stride requirement. If each thread loads 128 bits, it consumes bandwidth from 4 consecutive banks. For a warp to execute this without conflict, the stride between thread access must effectively skip 128 bits (or align perfectly with the super-bank structure). Compilers modeling this behavior in the cost model must ensure that vectorization does not reintroduce conflicts that outweigh the benefits of fewer instruction issues.For instance, MLIR's vector distribution passes often include specific rewrites to check if the innermost dimension of a tensor in shared memory is contiguous and if the dimension size is a multiple of the vector width. If these conditions are met, the compiler emits vector loads; otherwise, it falls back to scalar loads to avoid unaligned bank access penalties.