Finding the optimal schedule for a loop nest is fundamentally different from training the weights of a neural network. In model training, the optimization surface is continuous and differentiable, allowing gradients to guide the optimizer toward a minimum. In compiler optimization, the search space is discrete, non-differentiable, and often discontinuous. A small change in a parameter, such as increasing a tile size by one, can cause a cache conflict that dramatically drops performance, while a slightly larger change might align perfectly with cache lines and double the throughput.Because we cannot compute gradients with respect to schedule parameters (like tile_x or unroll_factor), we rely on black-box optimization algorithms. These algorithms sample configurations from the search space defined in the previous section, evaluating their quality using either real hardware measurements or a cost model. This section examines the progression from simple stochastic methods to sophisticated evolutionary strategies used in modern auto-schedulers like Ansor.The Baseline: Random SearchRandom search serves as the standard baseline for evaluating the effectiveness of any auto-tuning system. It operates by uniformly sampling valid configurations from the defined search space. While seemingly primitive, random search has distinct properties that make it relevant:Unbiased Exploration: It does not get stuck in local optima because it has no memory of previous states.Parallelism: Every sample is independent, allowing for massive parallelization across distributed workers.Search Space Coverage: It provides a statistical approximation of the distribution of performance across the entire space.In high-dimensional spaces, random search often outperforms grid search because grid search wastes resources exploring dimensions that may not significantly impact the objective function. However, for compiler schedules, random search is inefficient. The high-performance region of a schedule space is typically extremely narrow, often less than 0.01% of valid schedules achieve within 90% of peak performance. Relying on probability alone to land in this "good" region is computationally expensive.Simulated AnnealingTo improve upon random sampling, we need a method that investigates the space but eventually converges on high-performing regions. Simulated Annealing (SA) is a probabilistic technique capable of escaping local optima, a frequent problem in the complex optimization area of loop schedules.SA maintains a current state $s$ and iteratively proposes a neighbor state $s'$. If $s'$ performs better, it is accepted. If $s'$ performs worse, it is accepted with a probability $P$ defined by the Boltzmann distribution:$$ P(accept) = \exp\left( \frac{E(s) - E(s')}{T} \right) $$Here, $E(s)$ is the energy (inverse of performance) and $T$ is the temperature. The temperature starts high, allowing the algorithm to accept poor configurations and explore the space broadly. As the search progresses, $T$ decays according to a cooling schedule. When $T$ approaches zero, the algorithm behaves like a greedy hill-climber, only accepting improvements.In the context of TVM or MLIR, a "neighbor" is defined by mutating one or more specific knobs of the current schedule, for example, changing the vectorization factor from 4 to 8 or swapping the order of two loops.Evolutionary and Genetic AlgorithmsModern deep learning compilers, including the Ansor (AutoTIR) system, rely heavily on evolutionary algorithms. Unlike Simulated Annealing which tracks a single state, evolutionary algorithms maintain a population of candidate schedules. This approach is particularly effective for tensor programs because effective optimizations often require a combination of specific traits (e.g., a specific tiling structure combined with a specific vectorization strategy) that can be evolved independently and then combined.The evolutionary process in a compiler auto-tuner typically follows these steps:Initialization: A population is seeded, often with a mix of random valid schedules and heuristics-based schedules.Selection: The fittest individuals (those with the lowest predicted latency) are selected to breed the next generation.Crossover: Attributes from two parent schedules are combined. For a loop nest, this might mean taking the tiling structure from Parent A and the thread-binding strategy from Parent B.Mutation: Random alterations are applied to maintain genetic diversity. This prevents the population from converging prematurely to a sub-optimal local peak.digraph G { rankdir=TB; node [shape=box, style="filled", fillcolor="#f8f9fa", fontname="Helvetica", fontsize=10, color="#dee2e6"]; edge [fontname="Helvetica", fontsize=9, color="#adb5bd"]; subgraph cluster_0 { label = "Evolutionary Cycle"; fontname = "Helvetica"; fontsize = 12; style = "dashed"; color = "#adb5bd"; Pop [label="Current Population\n(Candidate Schedules)", fillcolor="#e7f5ff", color="#74c0fc"]; Select [label="Selection Strategy\n(Probabilistic Ranking)", fillcolor="#fff0f6", color="#faa2c1"]; Ops [label="Genetic Operations\n(Mutation & Crossover)", fillcolor="#f3f0ff", color="#b197fc"]; NewPop [label="Offspring Generation", fillcolor="#e7f5ff", color="#74c0fc"]; } CostModel [label="Cost Model / Hardware Runner\n(Fitness Function)", shape=component, fillcolor="#ffec99", color="#fcc419"]; Pop -> Select; Select -> Ops [label="Top k candidates"]; Ops -> NewPop; NewPop -> CostModel [label="Predict Performance"]; CostModel -> Pop [label="Update Scores"]; }The workflow of an evolutionary search strategy in auto-scheduling. The population evolves through mutation and crossover, guided by a fitness function derived from a cost model or hardware execution.Mutation Operators in Code GenerationImplementing mutation for code generation is stricter than in general optimization because the resulting schedule must remain legal. Invalid mutations can break data dependencies or violate hardware constraints (like using more shared memory than available).Common mutation operators include:Tile Size Mutation: Multiplies or divides a tile size by a factor (e.g., changing a block size of 32 to 64).Reorder Mutation: Randomly swaps two adjacent loops in the loop nest.Annotation Mutation: Changes pragmas such as unroll depth or vectorize width.Parallelization Mutation: Changes the binding of loops to GPU threads (blockIdx/threadIdx).When performing crossover, the compiler must ensure that the combined traits do not conflict. For instance, if Parent A tiles the reduction axis and Parent B does not, merging them blindly could result in a malformed IR.Exploration vs. ExploitationThe core challenge in designing these search algorithms is balancing exploration (finding new, potentially better basins of attraction) and exploitation (refining the best solution found so far).Random Search is 100% exploration, 0% exploitation.Greedy Search is nearly 0% exploration, 100% exploitation.Evolutionary Algorithms provide a tunable balance. The mutation rate controls exploration, while the selection pressure controls exploitation.In practice, auto-schedulers often use a multi-level approach. They might start with a high-exploration phase (like random search or high-temperature SA) to seed the cost model with diverse data points. Once the cost model becomes reasonably accurate, the system switches to an evolutionary strategy to refine the best candidates aggressively.The following chart demonstrates typical convergence behaviors of these algorithms when tuning a matrix multiplication kernel on a GPU.{"layout": {"title": {"text": "Search Algorithm Convergence Comparison", "font": {"size": 16, "family": "Helvetica"}}, "xaxis": {"title": {"text": "Search Iterations"}, "showgrid": true, "gridcolor": "#e9ecef"}, "yaxis": {"title": {"text": "Throughput (TFLOPS)"}, "showgrid": true, "gridcolor": "#e9ecef"}, "plot_bgcolor": "#ffffff", "showlegend": true, "legend": {"x": 0.7, "y": 0.1}}, "data": [{"x": [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500], "y": [10, 12, 15, 18, 19, 20, 21, 21, 22, 22, 23], "type": "scatter", "mode": "lines", "name": "Random Search", "line": {"color": "#adb5bd", "width": 2}}, {"x": [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500], "y": [10, 25, 35, 45, 55, 62, 68, 72, 75, 78, 80], "type": "scatter", "mode": "lines", "name": "Evolutionary Search", "line": {"color": "#339af0", "width": 3}}, {"x": [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500], "y": [10, 30, 40, 42, 45, 65, 66, 70, 71, 75, 76], "type": "scatter", "mode": "lines", "name": "Simulated Annealing", "line": {"color": "#ff6b6b", "width": 2, "dash": "dot"}}]}Comparison of throughput convergence over time. Random search struggles to improve from a baseline. Simulated annealing improves in steps but can plateau. Evolutionary search consistently climbs by combining the best traits of high-performing candidates.From Search to Cost ModelsWhile evolutionary algorithms are powerful, they require thousands of evaluations to converge. Compiling and running thousands of kernels on a GPU is slow, taking minutes or even hours for a single operator. This latency is unacceptable for just-in-time (JIT) compilation scenarios.To solve this, modern frameworks do not execute every candidate generated by the evolutionary search. Instead, they query a statistical cost model. The search algorithm generates a candidate, the cost model predicts its performance, and only the most promising candidates are eventually compiled and measured on hardware to fine-tune the model. This interaction between the search algorithm and the cost model enables the compiler to traverse the exponential optimization space efficiently.