Memory is physically one-dimensional. While deep learning frameworks present tensors as multi-dimensional objects with shapes like , the hardware sees a flat sequence of bytes. The mapping between the logical tensor indices and the physical memory addresses determines the memory layout. A compiler's choice of memory layout is one of the most significant decisions in graph-level optimization, as it directly dictates data locality, cache utilization, and the ability to use specialized hardware instructions.
A mismatch between the tensor layout and the hardware intrinsic requirements results in scattered memory access patterns. This leads to high cache miss rates and prevents the use of high-throughput instructions like Tensor Cores or AVX-512. Layout transformation is the compiler pass responsible for rewriting the graph to ensure that data is stored in memory in the order that the compute engine expects to consume it.
The efficiency of an operation depends on the stride of its memory access. Consider a standard 2D convolution. The operation involves a dot product between the input channels and the filter weights.
If we store data in the format (Batch, Channel, Height, Width), the inner dimension is Width. Values that are adjacent spatially in the image (width-wise) are adjacent in memory. However, values at the same spatial position across different channels are separated by elements.
Contrast this with the format (Batch, Height, Width, Channel). Here, the inner dimension is the Channel. Values corresponding to the same pixel across all channels are stored contiguously. Since modern convolution implementations often reduce over the channel dimension (accumulating inputs across depth), allows the hardware to load a dense vector of channel data in a single transaction. This significantly improves memory coalescing on GPUs and enables vectorization on CPUs.
We can define the memory address for a 4D tensor index in layout as:
For , the mapping changes to:
When the compiler lowers a graph to a specific target, it queries the backend for the preferred layout. For NVIDIA GPUs using Tensor Cores (via cuDNN or Cutlass), the preferred layout is almost exclusively (often referred to as channels-last). For CPUs relying on SIMD instructions, the optimal layout is often a blocked format which creates chunks of data sized to fit exactly into vector registers.
For CPU targets, standard layouts like or are often insufficient for maximizing arithmetic intensity. To fully utilize SIMD units (such as AVX-512 or ARM Neon), the compiler often employs layout blocking (also known as tiling or packing).
Blocking involves splitting a dimension into an outer dimension and an inner dimension of a fixed size . For example, we can transform the Channel dimension into . This changes a 4D tensor into a 5D tensor , where the lowercase represents a block of channels (e.g., 8 or 16) stored contiguously.
If we choose a block size of 16, the layout becomes . The memory address calculation becomes:
This structure ensures that when the CPU processes a spatial pixel, it loads exactly 16 channels into a 512-bit vector register in a single instruction. This eliminates the need for gather/scatter instructions and ensures the data is perfectly aligned for the FMA (Fused Multiply-Add) units.
The following chart illustrates the performance implications of layout choices on memory throughput for a generic convolution operation.
Comparison of effective memory bandwidth utilization for different tensor layouts. Blocked and channels-last layouts significantly outperform the planar NCHW format on modern dense architectures by minimizing cache line evictions.
Changing the layout of a single operator is trivial; handling the implications for the entire graph is complex. If a convolution is converted from to , its input must be transposed. If the subsequent operation (e.g., a Batch Normalization or ReLU) expects , the output must be transposed back.
Inserting explicit Transpose operations before and after every node defeats the purpose of optimization. Memory bandwidth is expensive, and transposing large tensors is a memory-bound operation that consumes significant time and energy. To solve this, compilers implement a pass called Layout Propagation.
In this pass, the compiler assigns a preferred layout to specific "anchor" operators (usually Convolutions or Matrix Multiplications) based on the hardware target. It then traverses the graph to propagate this layout requirement to neighboring operators. Element-wise operations like ReLU, Add, or Sigmoid are layout-agnostic; they can operate on data just as easily as data without changing their mathematical definition.
The compiler pushes the layout transformation through these agnostic nodes until it hits a boundary where the layout must change (e.g., a Reshape op or an external output). This allows the compiler to transform entire subgraphs to the target layout, reducing the number of necessary transposes to the absolute minimum at the graph boundaries.
The diagram below depicts the transformation of a subgraph during the layout propagation pass.
Visualization of Layout Propagation. The compiler identifies the Conv2D requires NHWC. Instead of wrapping only the Conv2D in transposes, the compiler propagates the NHWC layout through the layout-agnostic ReLU, moving the expensive back-transformation to the end of the sequence.
Implementing layout transformation requires strict handling of the operator attributes. When the layout changes, the compiler must update not only the input tensors but also the static attributes of the operator.
stride, padding, or dilation defined along specific axes. If the layout rotates, these attributes must be re-indexed. For example, a stride of [2, 2] usually applies to . If the layout changes such that the spatial dimensions move from indices 2 and 3 to indices 1 and 2, the stride attribute vector must be permuted to match.The goal is to ensure that the semantic meaning of the operation remains identical while the physical execution adapts to the hardware's strengths. By automating this process, the compiler decouples the model definition from the hardware implementation, allowing the same high-level Keras or PyTorch code to run efficiently on both a mobile CPU (using ) and a data center GPU (using ).
Was this section helpful?
© 2026 ApX Machine LearningEngineered with