Memory is physically one-dimensional. While deep learning frameworks present tensors as multi-dimensional objects with shapes like $(N, C, H, W)$, the hardware sees a flat sequence of bytes. The mapping between the logical tensor indices and the physical memory addresses determines the memory layout. A compiler's choice of memory layout is one of the most significant decisions in graph-level optimization, as it directly dictates data locality, cache utilization, and the ability to use specialized hardware instructions.A mismatch between the tensor layout and the hardware intrinsic requirements results in scattered memory access patterns. This leads to high cache miss rates and prevents the use of high-throughput instructions like Tensor Cores or AVX-512. Layout transformation is the compiler pass responsible for rewriting the graph to ensure that data is stored in memory in the order that the compute engine expects to consume it.The Impact of Strides and LocalityThe efficiency of an operation depends on the stride of its memory access. Consider a standard 2D convolution. The operation involves a dot product between the input channels and the filter weights.If we store data in the $NCHW$ format (Batch, Channel, Height, Width), the inner dimension is Width. Values that are adjacent spatially in the image (width-wise) are adjacent in memory. However, values at the same spatial position across different channels are separated by $H \times W$ elements.Contrast this with the $NHWC$ format (Batch, Height, Width, Channel). Here, the inner dimension is the Channel. Values corresponding to the same pixel across all channels are stored contiguously. Since modern convolution implementations often reduce over the channel dimension (accumulating inputs across depth), $NHWC$ allows the hardware to load a dense vector of channel data in a single transaction. This significantly improves memory coalescing on GPUs and enables vectorization on CPUs.We can define the memory address $A$ for a 4D tensor index $(n, c, h, w)$ in $NCHW$ layout as:$$A_{NCHW}(n, c, h, w) = n \cdot (C H W) + c \cdot (H W) + h \cdot (W) + w$$For $NHWC$, the mapping changes to:$$A_{NHWC}(n, c, h, w) = n \cdot (H W C) + h \cdot (W C) + w \cdot (C) + c$$When the compiler lowers a graph to a specific target, it queries the backend for the preferred layout. For NVIDIA GPUs using Tensor Cores (via cuDNN or Cutlass), the preferred layout is almost exclusively $NHWC$ (often referred to as channels-last). For CPUs relying on SIMD instructions, the optimal layout is often a blocked format which creates chunks of data sized to fit exactly into vector registers.Blocked Layouts and VectorizationFor CPU targets, standard layouts like $NCHW$ or $NHWC$ are often insufficient for maximizing arithmetic intensity. To fully utilize SIMD units (such as AVX-512 or ARM Neon), the compiler often employs layout blocking (also known as tiling or packing).Blocking involves splitting a dimension into an outer dimension and an inner dimension of a fixed size $k$. For example, we can transform the Channel dimension $C$ into $C_{out} \times k$. This changes a 4D tensor $NCHW$ into a 5D tensor $NCHWc$, where the lowercase $c$ represents a block of channels (e.g., 8 or 16) stored contiguously.If we choose a block size of 16, the layout becomes $NCHW16c$. The memory address calculation becomes:$$A_{NCHW16c}(n, c_{out}, h, w, c_{in}) = n \cdot (C_{out} H W \cdot 16) + c_{out} \cdot (H W \cdot 16) + h \cdot (W \cdot 16) + w \cdot (16) + c_{in}$$This structure ensures that when the CPU processes a spatial pixel, it loads exactly 16 channels into a 512-bit vector register in a single instruction. This eliminates the need for gather/scatter instructions and ensures the data is perfectly aligned for the FMA (Fused Multiply-Add) units.The following chart illustrates the performance implications of layout choices on memory throughput for a generic convolution operation.{ "layout": { "title": "Memory Throughput Efficiency by Layout Strategy", "xaxis": { "title": "Layout Format", "showgrid": false }, "yaxis": { "title": "Effective Bandwidth (GB/s)", "showgrid": true, "gridcolor": "#dee2e6" }, "plot_bgcolor": "white", "width": 600, "height": 400, "font": { "family": "Arial, sans-serif", "color": "#495057" }, "margin": { "l": 60, "r": 30, "t": 50, "b": 50 } }, "data": [ { "x": [ "NCHW (Standard)", "NHWC (Channels-Last)", "NCHW16c (Blocked)" ], "y": [ 450, 780, 820 ], "type": "bar", "marker": { "color": [ "#a5d8ff", "#748ffc", "#4dabf7" ] }, "text": [ "Strided Access Overhead", "Coalesced Access", "Vector Aligned" ], "textposition": "auto" } ] }Comparison of effective memory bandwidth utilization for different tensor layouts. Blocked and channels-last layouts significantly outperform the planar NCHW format on modern dense architectures by minimizing cache line evictions.Layout PropagationChanging the layout of a single operator is trivial; handling the implications for the entire graph is complex. If a convolution is converted from $NCHW$ to $NHWC$, its input must be transposed. If the subsequent operation (e.g., a Batch Normalization or ReLU) expects $NCHW$, the output must be transposed back.Inserting explicit Transpose operations before and after every node defeats the purpose of optimization. Memory bandwidth is expensive, and transposing large tensors is a memory-bound operation that consumes significant time and energy. To solve this, compilers implement a pass called Layout Propagation.In this pass, the compiler assigns a preferred layout to specific "anchor" operators (usually Convolutions or Matrix Multiplications) based on the hardware target. It then traverses the graph to propagate this layout requirement to neighboring operators. Element-wise operations like ReLU, Add, or Sigmoid are layout-agnostic; they can operate on $NHWC$ data just as easily as $NCHW$ data without changing their mathematical definition.The compiler pushes the layout transformation through these agnostic nodes until it hits a boundary where the layout must change (e.g., a Reshape op or an external output). This allows the compiler to transform entire subgraphs to the target layout, reducing the number of necessary transposes to the absolute minimum at the graph boundaries.The diagram below depicts the transformation of a subgraph during the layout propagation pass.digraph G { rankdir=LR; bgcolor="transparent"; node [fontname="Arial", shape=box, style=filled, color="#dee2e6", fillcolor="white"]; edge [fontname="Arial", color="#868e96"]; subgraph cluster_0 { label = "Original Graph (NCHW)"; style = dashed; color = "#adb5bd"; fontcolor = "#868e96"; node_input [label="Input\n(NCHW)", fillcolor="#e7f5ff", color="#74c0fc"]; node_conv [label="Conv2D\n(NCHW)", fillcolor="#ffe3e3", color="#ff8787"]; node_relu [label="ReLU", fillcolor="#e9ecef"]; node_input -> node_conv; node_conv -> node_relu; } subgraph cluster_1 { label = "Optimized Graph (NHWC)"; style = solid; color = "#495057"; fontcolor = "#495057"; opt_input [label="Input\n(NCHW)", fillcolor="#e7f5ff", color="#74c0fc"]; opt_trans_in [label="Layout Transform\nNCHW -> NHWC", shape=hexagon, fillcolor="#fff9db", color="#fab005"]; opt_conv [label="Conv2D\n(NHWC)", fillcolor="#d0bfff", color="#9775fa"]; opt_relu [label="ReLU\n(in-place)", fillcolor="#e9ecef"]; opt_trans_out [label="Layout Transform\nNHWC -> NCHW", shape=hexagon, fillcolor="#fff9db", color="#fab005"]; opt_input -> opt_trans_in; opt_trans_in -> opt_conv; opt_conv -> opt_relu [label="Layout Propagated"]; opt_relu -> opt_trans_out; } }Visualization of Layout Propagation. The compiler identifies the Conv2D requires NHWC. Instead of wrapping only the Conv2D in transposes, the compiler propagates the NHWC layout through the layout-agnostic ReLU, moving the expensive back-transformation to the end of the sequence.Implementation ConstraintsImplementing layout transformation requires strict handling of the operator attributes. When the layout changes, the compiler must update not only the input tensors but also the static attributes of the operator.Weight Transformation: The weights of a convolution layer are constant tensors. When switching from $NCHW$ to $NHWC$, the weights must be permanently reshaped at compile time (offline). If the compiler fails to do this offline, the runtime will incur a heavy penalty transposing weights on every inference pass.Attribute Mapping: Operations often have attributes like stride, padding, or dilation defined along specific axes. If the layout rotates, these attributes must be re-indexed. For example, a stride of [2, 2] usually applies to $(H, W)$. If the layout changes such that the spatial dimensions move from indices 2 and 3 to indices 1 and 2, the stride attribute vector must be permuted to match.The goal is to ensure that the semantic meaning of the operation remains identical while the physical execution adapts to the hardware's strengths. By automating this process, the compiler decouples the model definition from the hardware implementation, allowing the same high-level Keras or PyTorch code to run efficiently on both a mobile CPU (using $NCHW4c$) and a data center GPU (using $NHWC$).