Processing high-dimensional tensors on hardware requires mapping logical dimensions to a linear physical memory address space. While frameworks often default to specific layouts for user convenience or historical reasons, the chosen memory layout significantly influences the performance of convolution and matrix multiplication kernels. A compiler optimizes execution by rewriting the graph to use memory layouts that align with the hardware's memory hierarchy and vector instruction sets.The Disconnect Between Logical and Physical LayoutsA 4D tensor representing a batch of images typically has dimensions denoted as $N$ (Batch), $C$ (Channel), $H$ (Height), and $W$ (Width). Mathematically, accessing an element at $(n, c, h, w)$ is abstract. Physically, this element resides at a specific offset in a 1D array of RAM.The stride configuration determines this offset. Two dominant layouts exist in deep learning:NCHW (Channel-First): Data is stored planar-style. All pixels for the first channel are stored contiguously, followed by all pixels for the second channel. This is the default in PyTorch.NHWC (Channel-Last): Data is stored interleaved. The values for all channels of a specific pixel are stored together. This is the default in many backend inference engines and TensorFlow.The address calculation differs for each. For a tensor with shape $(N, C, H, W)$, the linear address for an element is calculated as:NCHW Addressing: $$ \text{Offset} = n \cdot (C \cdot H \cdot W) + c \cdot (H \cdot W) + h \cdot W + w $$NHWC Addressing: $$ \text{Offset} = n \cdot (H \cdot W \cdot C) + h \cdot (W \cdot C) + w \cdot C + c $$The choice of layout dictates the distance in memory between adjacent elements required for a calculation. If a convolution kernel requires summing across all channels for a specific pixel location, NHWC places those values adjacent in memory. In NCHW, those values are strided by the spatial size of the image ($H \times W$), potentially causing cache thrashing.Visualization of Memory StridesTo understand why compilers switch layouts, consider how data sits in linear memory. In the following diagram, we visualize the memory arrangement of a simplified $1 \times 3 \times 2 \times 2$ tensor (1 image, 3 channels, 2x2 spatial resolution).digraph G { rankdir=TB; node [shape=record, style=filled, fontname="Helvetica", fontsize=10, color="#dee2e6"]; edge [fontname="Helvetica", fontsize=9, color="#868e96"]; subgraph cluster_0 { label = "NCHW Layout (Planar)"; fontname="Helvetica"; style=dashed; color="#adb5bd"; nchw_mem [label="<f0> R(0,0)|<f1> R(0,1)|<f2> R(1,0)|<f3> R(1,1)|<f4> G(0,0)|<f5> G(0,1)|<f6> G(1,0)|<f7> G(1,1)|<f8> B(0,0)|<f9> B(0,1)|<f10> B(1,0)|<f11> B(1,1)", fillcolor="#eebefa"]; } subgraph cluster_1 { label = "NHWC Layout (Interleaved)"; fontname="Helvetica"; style=dashed; color="#adb5bd"; nhwc_mem [label="<f0> R(0,0)|<f1> G(0,0)|<f2> B(0,0)|<f3> R(0,1)|<f4> G(0,1)|<f5> B(0,1)|<f6> R(1,0)|<f7> G(1,0)|<f8> B(1,0)|<f9> R(1,1)|<f10> G(1,1)|<f11> B(1,1)", fillcolor="#a5d8ff"]; } caption [label="Note: In NCHW, channels (R,G,B) are separated by spatial dimensions. In NHWC, channels for pixel (0,0) are adjacent.", shape=plaintext, fillcolor=none, color=none]; nhwc_mem -> caption [style=invis]; }Comparison of linear memory placement for NCHW versus NHWC formats.In the NCHW example, accessing Red, Green, and Blue for pixel $(0,0)$ involves jumping over all other spatial positions. In NHWC, R, G, and B for pixel $(0,0)$ are neighbors.Hardware Alignment and VectorizationThe primary motivation for Layout Transformation is hardware efficiency. Modern CPUs and GPUs rely heavily on SIMD (Single Instruction, Multiple Data) instructions and tensor cores to perform arithmetic. These units function most efficiently when loading contiguous blocks of data.Spatial Locality: When a convolution slides over an image, it performs a dot product between the weights and the input channels. If the layout is NHWC, the inner loop iterates over $C$. Since these values are contiguous, the CPU can load a vector of 8 or 16 floats in a single cycle. If the layout is NCHW, the vector load gathers data from disjoint memory locations, which is significantly slower.Tensor Cores: Dedicated matrix acceleration units (like NVIDIA Tensor Cores) often require data in specific blocked formats (e.g., matrices divided into small tiles) to perform matrix multiplication. The compiler must ensure the input tensors are shaped correctly before feeding them into these accelerators.The Layout Transformation PassWhen an ML compiler analyzes a computation graph, it looks for convolution and pooling operators that dominate the execution time. If the target hardware (e.g., an ARM CPU or an NVIDIA GPU) prefers a specific layout, the compiler initiates a transformation pass.This process involves more than just transposing a tensor. It requires rewriting the operators themselves. If the original graph contains a Conv2D operator configured for NCHW, the compiler replaces it with a Conv2D variant optimized for NHWC.However, simply swapping the operator is insufficient because the input data coming from the user or previous layers might still be in the original format. The compiler must insert layout transformation nodes, often called Transpose or Permute, to align the data.Consider a subgraph where a convolution is followed by a ReLU activation:Original: Input (NCHW) $\rightarrow$ Conv2D (NCHW) $\rightarrow$ ReLU $\rightarrow$ OutputTransformed: Input (NCHW) $\rightarrow$ Transpose(to NHWC) $\rightarrow$ Conv2D (NHWC) $\rightarrow$ ReLU $\rightarrow$ Transpose(to NCHW) $\rightarrow$ OutputThe compiler also performs layout propagation. If multiple convolutions happen in sequence, it is inefficient to transpose the data back and forth between every layer. The compiler propagates the preferred layout through element-wise operations (like ReLU, which is layout-agnostic) to minimize the number of transpose operations.Blocked Layouts and PackingBeyond the standard NCHW and NHWC formats, compilers often utilize blocked layouts (also known as tiled or packed layouts) for specific backend targets. These are hybrid formats designed to fit data exactly into the vector registers of a specific CPU architecture (like AVX-512) or matrix units.A common blocked layout is NCHWc, often denoted as $N C/k H W k c$. Here, the channel dimension $C$ is split into an outer dimension and an inner block of size $k$. For example, if $k=16$, the layout groups 16 channels together. This ensures that the inner dimension is always a multiple of the vector length, removing the need for complex boundary checks during the innermost loop of the kernel.Using blocked layouts requires the compiler to completely rewrite the indexing logic of the consumer kernels. While this increases the complexity of code generation, the performance gains on CPUs are substantial, often yielding 2x to 3x speedups over standard NCHW execution.Evaluating the Cost ModelLayout transformation is not always beneficial. The Transpose operations inserted by the compiler have a cost in terms of memory bandwidth. The compiler uses a cost model to weigh the benefits.If a graph consists of a single lightweight convolution followed by memory-heavy operations, the cost of reordering the data might exceed the speedup gained from the optimized convolution kernel. Modern ML compilers use heuristics or auto-tuning results to decide whether the layout conversion is globally optimal for the specific model and hardware pair.