As we shift focus from optimizing the high-level computation graph (Chapter 3) to the individual tensor operations, we encounter the core computational kernels that often dominate execution time in ML workloads. Operations specified mathematically, such as matrix multiplication or convolution, need a concrete implementation for execution. Very often, this implementation takes the form of nested loops iterating over the dimensions of the input and output tensors. Understanding how to represent these fundamental operations as loop nests is the first step towards applying powerful low-level compiler optimizations.
Consider the standard matrix multiplication operation: C=A×B where A is an M×K matrix, B is a K×N matrix, and the resulting matrix C is M×N. Each element Cij in the output matrix is computed as the dot product of the i-th row of A and the j-th column of B: Cij=∑k=0K−1AikBkj This mathematical definition translates directly into a triply nested loop structure. Assuming standard row-major storage for the matrices in a C-like language, the computation can be expressed as:
// Assuming C is initialized to zero
// A is M x K, B is K x N, C is M x N
for (int i = 0; i < M; ++i) { // Loop over rows of C (and A)
for (int j = 0; j < N; ++j) { // Loop over columns of C (and B)
for (int k = 0; k < K; ++k) { // Loop over the reduction dimension
C[i * N + j] += A[i * K + k] * B[k * N + j];
}
}
}
Here:
i
).j
).k
), performing the multiply-accumulate operations.The order of these loops (i, j, k
) is just one possibility. Other permutations (i, k, j
, j, i, k
, etc.) compute the same result but can have vastly different performance characteristics due to memory access patterns and cache behavior. This loop nest representation, capturing the iteration space (0≤i<M,0≤j<N,0≤k<K) and the memory accesses within the loop body, becomes the primary target for the optimization techniques discussed later in this chapter, such as polyhedral modeling.
Convolution operations, central to convolutional neural networks (CNNs), also translate naturally into complex loop nests. A typical 2D convolution applies a filter (kernel) across an input feature map to produce an output feature map. Consider a simplified 2D convolution without batching or multiple channels for illustration:
Input I of size H×W, Kernel K of size R×S, Output O of size (H−R+1)×(W−S+1).
The computation for an output element Oxy involves a sum over the kernel dimensions: Oxy=∑r=0R−1∑s=0S−1Ix+r,y+sKrs
This translates into a structure with at least four nested loops:
// Simplified 2D Convolution (Output O initialized to zero)
// Input I: H x W, Kernel K: R x S, Output O: P x Q where P=H-R+1, Q=W-S+1
for (int p = 0; p < P; ++p) { // Output height
for (int q = 0; q < Q; ++q) { // Output width
for (int r = 0; r < R; ++r) { // Kernel height
for (int s = 0; s < S; ++s) { // Kernel width
O[p * Q + q] += I[(p + r) * W + (q + s)] * K[r * S + s];
}
}
}
}
Real-world convolutions in ML models add loops for batch size (N) and input/output channels (Cin,Cout), resulting in 6 or even 7 nested loops. The exact indexing expressions also depend significantly on the data layout used (e.g., NCHW vs. NHWC). NCHW (Batch, Channels, Height, Width) is common in frameworks like PyTorch, while NHWC is often preferred by TensorFlow and hardware accelerators for performance reasons. The choice of layout directly impacts the indices calculated within the innermost loops and affects memory access patterns.
Representing tensor computations as these explicit loop nests provides a concrete structure that compilers can analyze and manipulate. The loops define the iteration space, and the array accesses within the loop body (C[i][j], A[i][k], etc.) define the data dependencies and memory access patterns.
Data dependencies for computing the first column (j=0) of a 2x2 matrix multiplication (M=2,K=2,N=1). Note the reuse of input elements from matrix B.
This explicit, structured representation is the essential starting point for techniques like polyhedral modeling, loop tiling, fusion, and vectorization. These methods analyze the dependencies and iteration spaces defined by the loop nests to transform them into equivalent, but significantly more efficient, sequences of operations optimized for specific hardware targets. The next sections will examine how polyhedral modeling provides a formal framework for these powerful transformations.
© 2025 ApX Machine Learning