Automating the optimization of deep learning operators begins with a formal definition of the optimization boundaries. Compiler theory views a program not as a static artifact but as a single point within a multidimensional domain of functional equivalents. This domain is the search space. It contains every possible version of a kernel that produces the correct mathematical result but differs in its execution strategy.Constructing this space requires parametrizing the transformation primitives discussed in previous chapters. Instead of manually writing a specific schedule with fixed tiling sizes (e.g., $32 \times 32$), we expose these values as variables. The compiler then treats code generation as a constraint satisfaction and minimization problem where the objective function is execution time.Parametrization of Loop TransformationsThe foundation of the search space lies in the transformation primitives applied to the Intermediate Representation (IR). For a standard tensor operation, the primary axes of variation are loop tiling (splitting), ordering, and hardware binding.Consider a simple dense matrix multiplication loop nest $C[i, j] += A[i, k] \times B[k, j]$. To define the search space, we replace concrete integer constants with symbolic variables.Split FactorsWhen applying loop tiling, a loop of extent $N$ is split into outer and inner loops $N_{outer}$ and $N_{inner}$. The split factor $f$ defines the size of the inner loop. The set of valid values for $f$ usually consists of the factors of $N$. However, in scenarios where loop padding is allowed, $f$ can be any integer up to $N$.If we apply multi-level tiling (e.g., to map to the thread block, warp, and thread levels of a GPU), we introduce multiple split factors. For a loop of length $L$ split into $k$ levels, the search space for this dimension becomes the set of integer partitions such that:$$ \prod_{i=1}^{k} f_i = L $$Loop PermutationsOnce loops are split, the order in which they nest determines data locality. For a nest of depth $d$, there are $d!$ possible permutations. While a simple 3-loop nest has only 6 permutations, a tiled matrix multiplication with three levels of tiling results in 9 loops (an outer, middle, and inner loop for each of axes $i, j, k$). This yields $9! = 362,880$ theoretical orderings.We prune this space immediately using dependency analysis. Only permutations that preserve data dependencies are included in the valid search space.Composition and Tree StructuresWe can visualize the construction of a search space as a decision tree. Each node represents a transformation choice (e.g., "Split axis $i$") and the branches represent the parameter values. A path from the root to a leaf represents a fully defined candidate schedule.digraph G { rankdir=TB; node [fontname="Sans-Serif" shape=rect style=filled fillcolor="#e9ecef" color="#adb5bd"]; edge [color="#868e96"]; Root [label="Initial Loop Nest (i, j, k)" fillcolor="#a5d8ff"]; Split_I [label="Split Axis i" fillcolor="#bac8ff"]; Split_J [label="Split Axis j" fillcolor="#bac8ff"]; Factors_I1 [label="factors=[4, 8, 16...]" fillcolor="#eebefa"]; Factors_I2 [label="factors=[32, 64...]" fillcolor="#eebefa"]; Reorder1 [label="Reorder (i_o, j_o, k, i_i, j_i)" fillcolor="#ffc9c9"]; Reorder2 [label="Reorder (j_o, i_o, k, j_i, i_i)" fillcolor="#ffc9c9"]; Vectorize [label="Vectorize Inner Loop" fillcolor="#b2f2bb"]; Unroll [label="Unroll Inner Loop" fillcolor="#b2f2bb"]; Root -> Split_I; Root -> Split_J; Split_I -> Factors_I1; Split_I -> Factors_I2; Factors_I1 -> Reorder1; Factors_I2 -> Reorder2; Reorder1 -> Vectorize; Reorder2 -> Unroll; Vectorize -> Result1 [label="Candidate A"]; Unroll -> Result2 [label="Candidate B"]; Result1 [label="Schedule A" shape=oval fillcolor="#fcc2d7"]; Result2 [label="Schedule B" shape=oval fillcolor="#fcc2d7"]; }A hierarchical view of how transformation decisions branch out to form the search space. Each level adds a constraint, narrowing the possibilities until a concrete schedule is formed.Hardware Constraints and ValidityNot all combinations of parameters yield executable code. The "valid" search space is a subset of the total Cartesian product of parameters, restricted by hardware limits.Resource Limits: On a GPU, the product of thread indices mapped to threadIdx.x, threadIdx.y, and threadIdx.z cannot exceed the maximum threads per block (typically 1024). A schedule that tiles dimensions $i$ and $j$ with factors 64 and 32 respectively would require $64 \times 32 = 2048$ threads, making it invalid.Shared Memory: Tiling sizes determine the size of memory blocks promoted to shared memory. If the calculated footprint exceeds the available shared memory per Streaming Multiprocessor (SM), the kernel will fail to launch or suffer from reduced occupancy.Vectorization Constraints: Vectorizing a loop requires the memory access pattern to be contiguous and the loop extent to be a multiple of the vector length. Schedules that attempt to vectorize non-contiguous dimensions are discarded.The search mechanism must efficiently filter these invalid configurations. In frameworks like Ansor (AutoTVM v2), this is often handled by sketch generation rules that only propose structurally valid schedules, followed by a random sampling step that verifies resource constraints.Template-Based vs. Generation-Based SpacesThere are two primary approaches to defining the bounds of the search space in modern compilers.Template-Based (AutoTVM)In this approach, the engineer writes a generic schedule "template" and marks specific values as tunable knobs. For example:$$ \text{cfg.define_split("tile_k", k, num_outputs=2)} $$The search space is explicitly bounded by the user's definition. The compiler optimizes strictly within the parameters provided. While this ensures the structure of the optimization is known (e.g., we are definitely using tiling and caching), it relies on the engineer's intuition to include the best optimization strategies in the template. If the user forgets to include a knob for loop unrolling, the auto-tuner will never attempt it.Generation-Based (Ansor/MetaSchedule)Newer systems generate the search space automatically from the computational graph. The compiler analyzes the mathematical properties of the operators and applies a set of derivation rules to construct the space.Sketch Generation: The system creates high-level structural sketches. One sketch might represent "Always tile the reduction axis," while another represents "Inline this element-wise operation."Annotation: Once a sketch is selected, the system randomly annotates specific details, such as tile sizes or unroll factors.This approach expands the search space significantly. It removes the human bias in template design but increases the complexity of the search. The algorithm must traverse a much larger domain, necessitating the cost models we will examine later in this chapter.The Combinatorial ExplosionThe size of the search space grows exponentially with the complexity of the operator and the depth of the hardware hierarchy. For a Conv2D operator targeting a GPU, we must decide on:Tiling factors for Batch ($N$), Height ($H$), Width ($W$), Channel ($C$), Kernel ($K$), Filter ($F$).Mapping these tiles to Block X/Y/Z and Thread X/Y/Z.Cache read/write locations (Global to Shared, Shared to Register).Unrolling factors and Vectorization lengths.Even with aggressive pruning of invalid configurations, the space can easily contain billions of candidates.{"layout": {"title": {"text": "Search Space Size vs. Optimization Depth (Log Scale)", "font": {"family": "Arial, sans-serif", "size": 18, "color": "#495057"}}, "xaxis": {"title": {"text": "Number of Tiling Levels", "font": {"size": 14, "color": "#868e96"}}, "gridcolor": "#e9ecef"}, "yaxis": {"type": "log", "title": {"text": "Valid Schedules (Estimated)", "font": {"size": 14, "color": "#868e96"}}, "gridcolor": "#e9ecef"}, "plot_bgcolor": "white", "paper_bgcolor": "white", "margin": {"t": 60, "l": 60, "r": 30, "b": 60}}, "data": [{"x": [1, 2, 3, 4, 5], "y": [120, 14400, 1728000, 207360000, 24883200000], "type": "bar", "marker": {"color": ["#a5d8ff", "#74c0fc", "#4dabf7", "#339af0", "#228be6"]}}]}The volume of candidate schedules increases logarithmically as we add levels to the memory hierarchy (tiling levels). This chart assumes a moderate number of split factors and reordering options at each level.Effectively navigating this space requires more than brute force. We define the space to encompass the highest performance candidates while keeping it sparse enough to be searchable. This trade-off is the central challenge of auto-tuning. By formalizing optimization decisions as a parametric domain, we transition from manual heuristic tuning to automated, data-driven exploration.