Optimizing memory access patterns through tiling and vectorization ensures that individual processor cores operate efficiently. However, modern hardware achieves high performance primarily through massive concurrency. A single CPU may possess dozens of cores, while a GPU contains thousands of CUDA cores or streaming processors. To utilize these resources, an ML compiler must map the distinct iterations of a loop nest to parallel execution units.The process of parallelization transforms a sequential loop into a structure where multiple iterations execute simultaneously. This requires the compiler to identify which loops carry dependencies and which are independent. If the calculation of iteration $i$ depends on the result of iteration $i-1$, the loop is sequential and cannot be parallelized without algorithmic changes. Fortunately, many tensor operations, such as element-wise addition or the outer loops of matrix multiplication, are "embarrassingly parallel," meaning they possess no inter-iteration dependencies.CPU Parallelism: The Fork-Join ModelOn Central Processing Units (CPUs), parallelization is typically managed at a coarse granularity. The overhead of spawning and managing threads is significant relative to the cost of a single addition or multiplication. Therefore, compilers usually target the outermost loops for CPU parallelization.The standard execution model employed is the fork-join pattern. The main thread executes serial code until it encounters a parallel region (a loop). At this point, it "forks" a team of threads, dividing the loop iterations among them. Once the threads complete their assigned chunks, they "join" back into the main thread.For instance, a standard matrix multiplication loop nest:$$ \text{for } i \text{ in } 0..M: \ \quad \text{for } j \text{ in } 0..N: \ \quad \quad \text{for } k \text{ in } 0..K: \ \quad \quad \quad C[i, j] += A[i, k] \times B[k, j] $$A compiler targeting a multicore CPU will likely parallelize the loop over $i$. If $M=1024$ and the hardware has 4 cores, the compiler partitions the range so that Thread 0 handles $0..255$, Thread 1 handles $256..511$, and so on. This strategy works best when the amount of work inside the parallel loop is substantial enough to outweigh the synchronization costs.One common challenge here is false sharing. This occurs when multiple threads write to distinct variables that happen to reside on the same cache line. Although the logic is correct, the cache coherency protocol forces the cores to invalidate each other's cache lines repeatedly, destroying performance. Compilers mitigate this by ensuring that parallel tasks operate on memory regions that are sufficiently spaced apart or by using thread-local accumulation buffers.GPU Parallelism: Hierarchical Thread BindingGraphics Processing Units (GPUs) utilize a different execution model based on Single Instruction, Multiple Threads (SIMT). Unlike CPUs, GPUs are designed to handle thousands of lightweight threads with minimal context-switching overhead. However, these threads are not flat; they are organized hierarchically into blocks (or thread groups) and grids.In an ML compiler, this hierarchy is addressed through a technique called thread binding. Instead of relying on a runtime scheduler to distribute work, the compiler explicitly rewrites the loop variables to correspond to hardware identifiers.When a loop is marked for GPU execution, the compiler performs the following transformations:Splitting: The loop range is split into outer and inner components based on the hardware limits (e.g., max threads per block).Binding: The loop iterators are replaced by intrinsic hardware variables, such as blockIdx.x and threadIdx.x in CUDA terminology.For example, to map a loop of size $N$ to a GPU, the compiler might transform the index calculation. If we assign 128 threads per block, the linear index $i$ is reconstructed as:$$i = \text{blockIdx.x} \times 128 + \text{threadIdx.x}$$This binding process dictates exactly which data element each thread processes. The following diagram illustrates how a logical iteration space is decomposed and mapped to the GPU hierarchy.digraph G { rankdir=TB; node [shape=box, style=filled, fontname="Helvetica", fontsize=10]; edge [fontname="Helvetica", fontsize=9, color="#868e96"]; subgraph cluster_logical { label = "Logical Iteration Space (Example: 0 to 1023)"; style = dashed; color = "#adb5bd"; fontcolor = "#495057"; logical_loop [label="Global Loop Range\n[0...1023]", fillcolor="#e9ecef", color="#dee2e6"]; } subgraph cluster_hardware { label = "Hardware Mapping"; style = dashed; color = "#adb5bd"; fontcolor = "#495057"; block_0 [label="Block 0\n(indices 0-127)", fillcolor="#a5d8ff", color="#74c0fc"]; block_1 [label="Block 1\n(indices 128-255)", fillcolor="#a5d8ff", color="#74c0fc"]; block_n [label="Block ...", fillcolor="#a5d8ff", color="#74c0fc"]; thread_0 [label="Thread 0", fillcolor="#ffc9c9", color="#ffa8a8"]; thread_127 [label="Thread 127", fillcolor="#ffc9c9", color="#ffa8a8"]; } logical_loop -> block_0 [label=" Split by Block Dim "]; logical_loop -> block_1; logical_loop -> block_n; block_0 -> thread_0 [label=" map to threadIdx.x "]; block_0 -> thread_127; }The mapping process splits a global loop range into independent blocks, which are further subdivided into threads. This explicit binding allows the code generator to emit efficient GPU kernels.Loop Collapsing and FusionA single loop often does not contain enough iterations to saturate a modern GPU, which may require tens of thousands of active threads to hide memory latency effectively. If a tensor operation involves iterating over a small dimension (e.g., a batch size of 32) followed by a feature dimension of 64, neither loop alone provides sufficient parallelism.To address this, compilers employ loop collapsing (or flattening). This technique merges nested loops into a single linear loop. The compiler calculates a new upper bound based on the product of the original bounds ($32 \times 64 = 2048$) and generates code to reconstruct the original 2D indices from the single linear ID using modulo and division operations. This creates a larger, unified iteration space that can be easily mapped to the flat thread hierarchy of the hardware.Synchronization and ReductionsParallelization becomes complex when threads must communicate. A common scenario in machine learning is the reduction operation, such as summing values in a Softmax layer or computing the norm of a vector. In these cases, threads cannot simply run independently; they must combine their results.A standard strategy for parallel reduction is the tree reduction approach. Threads typically perform local reductions in registers, then synchronize at the block level using shared memory and barriers (e.g., __syncthreads() in CUDA), and finally write partial results to global memory.If the compiler fails to insert the correct synchronization barriers, race conditions occur, leading to non-deterministic numerical errors. Conversely, excessive synchronization forces threads to wait idle, reducing the parallelism gain. Advanced compilers analyze the data dependency graph to insert barriers only where strictly necessary, often employing "warp shuffles" on GPUs to exchange data between threads without the latency of shared memory access.Parallelization Trade-offsBlindly parallelizing every loop does not guarantee improved performance. There are distinct trade-offs that a cost model must evaluate:Granularity: Parallelizing inner loops on a CPU often leads to overhead that exceeds the computation time.Memory Bandwidth: If a kernel is memory-bound (waiting on data from RAM), adding more threads will not speed up execution and may increase cache contention.Resource Limits: GPUs have limits on the number of registers available per block. Utilizing too many threads or too many local variables can spill registers to global memory, drastically reducing speed.Effective auto-tuning searches specifically for the "sweet spot" configurations, determining how many threads to launch and how to map specific loop axes, to balance arithmetic intensity against memory bandwidth limitations.