Once we represent tensor operations and their data dependencies within the polyhedral model, we gain the ability to systematically transform the execution order of the loop iterations. This reordering, known as scheduling, is guided by mathematical transformations applied to the iteration vectors. The primary goals are to improve data locality, enhancing cache utilization, and to expose parallelism that can be exploited by multi-core CPUs or GPUs.
A schedule assigns a logical execution timestamp (often multi-dimensional) to each dynamic instance of a statement within the loop nest. For a statement instance Si executed at iteration vector x, its schedule is denoted by T(Si,x). A fundamental requirement is that the schedule must respect all data dependencies. If statement instance (Si,x) must execute before (Sj,y) due to a data dependency (written (Si,x)→(Sj,y)), then the schedule must ensure their logical timestamps reflect this order: T(Si,x)≺T(Sj,y) Here, ≺ denotes lexicographical order. Finding a schedule T that satisfies this for all dependencies while optimizing for locality or parallelism is the core task. Within the polyhedral model, schedules are typically restricted to affine functions of the loop iterators (x) and any symbolic constants (parameters) present in the loop bounds or accesses.
Two fundamental and widely used affine transformations for scheduling are tiling and skewing.
Tiling, also known as blocking, is arguably the most significant transformation for performance on modern memory hierarchies. It partitions the iteration space into smaller, regular blocks (tiles) that are executed atomically.
Motivation: Executing iterations within a tile brings data reuse to the forefront. If a tile accesses a small portion of the overall data, that data is more likely to fit into faster levels of the memory hierarchy (registers, L1/L2 cache). This enhances temporal locality (reusing the same data element) and spatial locality (accessing nearby data elements). Furthermore, tiles themselves can often be executed in parallel if dependencies permit, exposing coarse-grained parallelism. Consider matrix multiplication C=A×B. Tiling breaks down the large multiplication into smaller matrix multiplications operating on blocks of A, B, and C that fit better in cache.
Polyhedral Tiling: In the polyhedral framework, tiling is achieved by transforming the original iteration vector x into a higher-dimensional vector that separates tile coordinates from intra-tile coordinates. For a 2D loop nest with iterators (i,j) and a square tile size B, a common tiling transformation maps (i,j) to (ii,jj,i′,j′), where:
The resulting schedule might look like T(i,j)=(ii,jj,i′,j′). This transforms the original 2-deep loop nest into a 4-deep loop nest: two outer loops iterating over the tiles (ii,jj) and two inner loops iterating over the points within a tile (i′,j′).
Partitioning of a 4x4 iteration space into 2x2 tiles. Dependencies like D1 are contained within a tile (if tile size is large enough) or cross tile boundaries. Dependencies like D2 cross tile boundaries (tile-carried).
Legality: Tiling is only legal if the chosen schedule (including the tile and point loops) respects all original data dependencies. Dependencies can be classified as:
For tiling to enable parallelism across tiles (e.g., executing the ii loop in parallel), there must be no dependencies carried by that loop dimension in the schedule. Algorithms like the Pluto scheduler are designed to find multi-dimensional affine schedules that enable legal and efficient tiling, often maximizing the dimensions that can be legally parallelized.
Loop skewing is another affine transformation, often used to enable other transformations like tiling or vectorization when dependencies would otherwise prevent them.
Motivation: Consider a loop nest with a dependency structure that forms a "wavefront" pattern, common in stencil computations or some dynamic programming algorithms. For instance, a dependency (i,j)→(i+1,j+1). Direct tiling might be illegal because executing tiles purely column-wise or row-wise would violate the diagonal dependency. Skewing changes the shape of the iteration space relative to the dependencies.
Mechanism: Skewing transforms the iterators such that the relative order of dependent iterations is changed. A common skewing transformation for a 2D loop nest is: i′=i j′=j+α⋅i where α is the skewing factor. This maps the original rectangular iteration space into a parallelogram. The crucial effect is on the dependency vectors. If the original dependency vector was d=(di,dj), the new dependency vector d′ in the (i′,j′) space becomes (di,dj+α⋅di).
Effect of skewing (i′=i,j′=i+j) on a 2D iteration space. The original dependency d=(1,1) is transformed into d′=(1,2). This change might enable tiling along the i′ dimension.
Enabling Tiling: By changing the dependency vectors, skewing can make dependencies lexicographically positive along dimensions that were previously problematic. For the dependency d=(1,1), after skewing by α=1, the dependency becomes d′=(1,2). If we now tile the skewed space (i′,j′), the dependency is always carried across tiles in the i′ dimension (since di′′=1>0). This makes tiling legal, where simple tiling on the original space might not have been. The resulting execution proceeds along skewed wavefronts.
Skewing and tiling are powerful independently, but they are often combined. A common strategy is "skew-then-tile", where skewing first adjusts dependencies, and then tiling enhances locality and parallelism in the skewed space.
Finding the optimal combination of skewing, tiling, and other potential affine transformations (like permutation or scaling) is complex. It depends heavily on the specific dependency patterns and the target hardware architecture. This is where automatic scheduling tools based on polyhedral libraries like isl
(integer set library) come into play. These tools analyze the polyhedral representation (domains, dependencies, access functions) and employ algorithms (e.g., the Pluto algorithm or Feautrier's algorithm) to search for legal affine schedules T that optimize specific objectives, such as maximizing the number of parallel outer loops or minimizing data movement, incorporating tiling and skewing as needed.
Once a suitable schedule T (representing the transformed loop structure, potentially including tiling and skewing parameters) is determined, the next step, discussed in code generation, is to synthesize the actual optimized loop code that implements this schedule.
© 2025 ApX Machine Learning