Defining the search space is the foundational step in automated compiler optimization. Before a tuner can find the best schedule, it must understand the boundaries of valid configurations. The search space represents the complete set of feasible program transformations for a given operator on a specific hardware target. If the space is too small, the tuner might miss the most efficient optimizations. If the space is too large, the search becomes computationally expensive and may fail to converge on a solution within a reasonable time.The Anatomy of a Search SpaceAt the intermediate representation level, a computation is defined by nested loops over data arrays. The search space consists of tunable parameters that alter how these loops are executed without changing the numerical output. These parameters generally fall into three categories: tiling factors, execution ordering, and hardware bindings.Consider a simple vector addition where we iterate over $N$ elements. The baseline implementation is a single serial loop. To optimize this, we might apply loop tiling (splitting the loop into smaller chunks) to fit data into the CPU cache or to map work onto GPU threads.The decision of how to split the loop creates the first dimension of our search space. If $N = 1024$, valid tile sizes include 1, 2, 4, ..., 512, 1024. If we split the loop into two levels (outer and inner), the relationship is defined as:$$ i = i_{outer} \times \text{tile_size} + i_{inner} $$Here, the tile_size is a variable in our search space. The compiler must define a range of valid values for this variable. In a simple integer grid search, the space for this single decision contains all divisors of 1024.Composition of Optimization ChoicesReal-world operators like Matrix Multiplication or 2D Convolution involve multiple nested loops. The search space expands combinatorially as we add more axes and optimization primitives.For a Matrix Multiplication operation involving dimensions $M, N, K$, the compiler must decide tiling factors for all three dimensions simultaneously. Furthermore, the order in which these loops are nested affects memory locality. This introduces "reordering" into the search space.The following diagram illustrates how a single operator creates a branching tree of decisions, resulting in a specific configuration (or "candidate") at the leaf nodes.digraph G { rankdir=TB; bgcolor="#ffffff"; node [style=filled, shape=box, fontname="Arial", fontsize=10, color="#dee2e6", fillcolor="#f8f9fa"]; edge [color="#adb5bd"]; Start [label="Operator (e.g., MatMul)", fillcolor="#e7f5ff", color="#74c0fc"]; Split [label="Split Configs\n(Tile Sizes)", fillcolor="#eebefa", color="#da77f2"]; Order [label="Loop Reordering\n(Permutations)", fillcolor="#ffc9c9", color="#ff8787"]; Bind [label="Hardware Binding\n(Thread/Block)", fillcolor="#b2f2bb", color="#69db7c"]; Result [label="Generated Schedule", fillcolor="#fff3bf", color="#fcc419"]; Start -> Split; Split -> Order [label=" defines blocks"]; Order -> Bind [label=" maps to HW"]; Bind -> Result; subgraph cluster_0 { label="Search Space Variables"; fontname="Arial"; fontsize=10; color="#dee2e6"; style="dashed"; Split; Order; Bind; } }Structure of an optimization search space showing the sequence of decisions required to form a valid schedule.Constraints and ValidityNot every combination of parameters produces valid or efficient code. Defining the search space also involves enforcing constraints to prune invalid candidates immediately. This prevents the auto-tuner from wasting cycles evaluating broken code.Hard ConstraintsHard constraints are dictated by mathematical correctness or hardware limits.Divisibility: When splitting a loop of size $N$ into factors $A$ and $B$, the product $A \times B$ must typically equal $N$ (unless loop padding is enabled).Resource Limits: On a GPU, the maximum number of threads per block is fixed (e.g., 1024 on many NVIDIA architectures). A schedule that proposes a thread block size of 2048 is invalid and must be excluded from the space.Memory Capacity: Tiling sizes that require shared memory allocation exceeding the hardware's L1 scratchpad limit will cause runtime failures.Soft ConstraintsSoft constraints are heuristics used to guide the search toward high-performance regions. While violating these does not break the code, it almost certainly results in poor performance.Vector Length: On a CPU with AVX-512, vectorizing with a length that is not a multiple of the vector register size (e.g., 16 floats) is usually suboptimal.Unrolling Limits: Excessive loop unrolling can bloat instruction cache usage. Compilers often cap the maximum unroll factor in the search space definition.Parameterizing Hardware BindingsIn modern ML compilers like TVM or Halide, the search space explicitly includes the mapping of logical loops to physical hardware units. This is distinct from standard CPU loop optimization.For a GPU target, the search space includes variables that determine which loop level maps to blockIdx.x (CUDA block index) and which maps to threadIdx.x (CUDA thread index).If we have a tiled loop nest:Outer Loop $i$Inner Loop $j$The search space contains categorical variables representing the binding policy.Option A: Bind $i$ to Block, Bind $j$ to Thread.Option B: Bind $i$ to Thread, Bind $j$ to local register (sequential).This decision radically changes the memory access pattern. Option A utilizes parallelism across streaming multiprocessors, while Option B might be used for thread-local accumulation. The search space definition must enumerate these structural possibilities.The Combinatorial ExplosionThe size of the search space grows exponentially with the complexity of the operator. This phenomenon makes exhaustive search impossible for anything but trivial kernels.Consider a 2D convolution layer. The compiler might tune:Tile sizes for Height, Width, Input Channel, Output Channel (4 variables).Split factors for each tile (e.g., splitting into 2 or 3 levels).Loop ordering (permutations of 4-6 axes).Vectorization lengths.Unrolling factors.Even with modest ranges for each variable, the total number of permutations can reach billions.{"layout": {"title": {"text": "Search Space Growth by Complexity", "font": {"size": 14}}, "xaxis": {"title": "Optimization Variables", "showgrid": false}, "yaxis": {"title": "Candidate Schedules (Log Scale)", "type": "log", "showgrid": true, "gridcolor": "#e9ecef"}, "plot_bgcolor": "#ffffff", "paper_bgcolor": "#ffffff", "margin": {"l": 50, "r": 20, "t": 40, "b": 40}, "height": 300, "width": 600}, "data": [{"type": "bar", "x": ["Simple Tiling", "+ Reordering", "+ Unrolling", "+ Vectorization", "+ Thread Binding"], "y": [12, 144, 576, 4608, 368640], "marker": {"color": ["#a5d8ff", "#74c0fc", "#4dabf7", "#339af0", "#228be6"]}}]}Estimate of search space expansion as additional optimization axes are introduced for a standard matrix operation.Templates and SketchingTo manage this complexity, modern approaches often use "templates" or "sketches" rather than raw, unstructured search spaces.A Template defines the skeletal structure of the schedule manually but leaves the specific integer values as variables. For example, an engineer might write a schedule template that says: "Always tile the output channels and parallelize the outer loop, but let the auto-tuner decide the tile size $X$ and the unroll factor $Y$."This hybrid approach restricts the search space to a region known to be generally efficient, relying on the auto-tuner only for the hardware-specific tuning of numerical parameters. This significantly reduces the search time while still achieving high performance.In the next section, we will explore how cost models analyze these candidates to predict performance without running every single one on the actual hardware.