Automating the optimization of deep learning operators begins with a formal definition of the optimization boundaries. Compiler theory views a program not as a static artifact but as a single point within a multidimensional domain of functional equivalents. This domain is the search space. It contains every possible version of a kernel that produces the correct mathematical result but differs in its execution strategy.
Constructing this space requires parametrizing the transformation primitives discussed in previous chapters. Instead of manually writing a specific schedule with fixed tiling sizes (e.g., ), we expose these values as variables. The compiler then treats code generation as a constraint satisfaction and minimization problem where the objective function is execution time.
The foundation of the search space lies in the transformation primitives applied to the Intermediate Representation (IR). For a standard tensor operation, the primary axes of variation are loop tiling (splitting), ordering, and hardware binding.
Consider a simple dense matrix multiplication loop nest . To define the search space, we replace concrete integer constants with symbolic variables.
When applying loop tiling, a loop of extent is split into outer and inner loops and . The split factor defines the size of the inner loop. The set of valid values for usually consists of the factors of . However, in scenarios where loop padding is allowed, can be any integer up to .
If we apply multi-level tiling (e.g., to map to the thread block, warp, and thread levels of a GPU), we introduce multiple split factors. For a loop of length split into levels, the search space for this dimension becomes the set of integer partitions such that:
Once loops are split, the order in which they nest determines data locality. For a nest of depth , there are possible permutations. While a simple 3-loop nest has only 6 permutations, a tiled matrix multiplication with three levels of tiling results in 9 loops (an outer, middle, and inner loop for each of axes ). This yields theoretical orderings.
We prune this space immediately using dependency analysis. Only permutations that preserve data dependencies are included in the valid search space.
We can visualize the construction of a search space as a decision tree. Each node represents a transformation choice (e.g., "Split axis ") and the branches represent the parameter values. A path from the root to a leaf represents a fully defined candidate schedule.
A hierarchical view of how transformation decisions branch out to form the search space. Each level adds a constraint, narrowing the possibilities until a concrete schedule is formed.
Not all combinations of parameters yield executable code. The "valid" search space is a subset of the total Cartesian product of parameters, restricted by hardware limits.
threadIdx.x, threadIdx.y, and threadIdx.z cannot exceed the maximum threads per block (typically 1024). A schedule that tiles dimensions and with factors 64 and 32 respectively would require threads, making it invalid.The search mechanism must efficiently filter these invalid configurations. In frameworks like Ansor (AutoTVM v2), this is often handled by sketch generation rules that only propose structurally valid schedules, followed by a random sampling step that verifies resource constraints.
There are two primary approaches to defining the bounds of the search space in modern compilers.
In this approach, the engineer writes a generic schedule "template" and marks specific values as tunable knobs. For example:
The search space is explicitly bounded by the user's definition. The compiler optimizes strictly within the parameters provided. While this ensures the structure of the optimization is known (e.g., we are definitely using tiling and caching), it relies on the engineer's intuition to include the best optimization strategies in the template. If the user forgets to include a knob for loop unrolling, the auto-tuner will never attempt it.
Newer systems generate the search space automatically from the computational graph. The compiler analyzes the mathematical properties of the operators and applies a set of derivation rules to construct the space.
This approach expands the search space significantly. It removes the human bias in template design but increases the complexity of the search. The algorithm must traverse a much larger domain, necessitating the cost models we will examine later in this chapter.
The size of the search space grows exponentially with the complexity of the operator and the depth of the hardware hierarchy. For a Conv2D operator targeting a GPU, we must decide on:
Even with aggressive pruning of invalid configurations, the space can easily contain billions of candidates.
The volume of candidate schedules increases logarithmically as we add levels to the memory hierarchy (tiling levels). This chart assumes a moderate number of split factors and reordering options at each level.
Effectively navigating this space requires more than brute force. We define the space to encompass the highest performance candidates while keeping it sparse enough to be searchable. This trade-off is the central challenge of auto-tuning. By formalizing optimization decisions as a parametric domain, we transition from manual heuristic tuning to automated, data-driven exploration.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•