Defining a search space limits the boundaries of optimization, but it does not tell the compiler which configuration to choose. The sheer number of valid combinations in a typical loop nest often exceeds millions of possibilities. If the compiler were to compile and benchmark every single variation on real hardware, a single matrix multiplication kernel could take weeks to optimize. To make this process feasible, machine learning compilers rely on efficient search algorithms.These algorithms navigate the search space to identify high-performance schedules within a reasonable time budget. The goal is to maximize an objective function, usually throughput (FLOPS) or minimize latency, by iteratively selecting candidates that are likely to perform better than the current best.The Combinatorial ChallengeConsider a simple matrix multiplication operation involving three loops. To optimize this, we might apply loop tiling, vectorization, and unrolling. If we split each loop into three levels of tiling, and we have 10 candidate sizes for each tile, the number of combinations grows exponentially.Mathematically, if we have $k$ tunable parameters $p_1, p_2, \dots, p_k$, and each parameter has $|D_i|$ valid choices, the size of the search space $S$ is:$$ |S| = \prod_{i=1}^k |D_i| $$Even for moderate loop nests, $|S|$ can easily exceed $10^9$. Since measuring a single configuration on hardware takes a non-trivial amount of time (compilation + execution + overhead), we need a strategy to find the global maximum without visiting every point in $S$.Exploration versus ExploitationSearch algorithms in compilers generally balance two competing objectives:Exploration: Investigating unknown regions of the search space to avoid getting stuck in local optima.Exploitation: Refining the search around known high-performing configurations to squeeze out the last bit of performance.A naive approach is Random Search, where the compiler samples configurations from $S$ uniformly at random. While simple to implement and parallelize, it ignores the structure of the problem. It does not learn from previous failures or successes. Consequently, modern compilers rarely rely solely on random search. Instead, they use evolutionary algorithms or model-based methods.Evolutionary Search AlgorithmsEvolutionary algorithms (EA) treat the search process as a biological evolution. A "population" consists of a set of loop schedules. In each iteration (generation), the algorithm selects the best-performing schedules and applies genetic operators to create new candidates.Mutation and CrossoverIn the context of loop optimization, Mutation involves randomly altering one parameter of a schedule. For example, changing a tile size from 32 to 64, or changing the loop unrolling factor.Crossover combines features from two high-performing parent schedules. If Parent A has excellent memory tiling but poor vectorization, and Parent B has poor tiling but excellent vectorization, a crossover might produce a child schedule that inherits the tiling from A and the vectorization from B.Over many generations, the population drifts toward higher-performing regions of the search space. This method is effective because valid schedules often cluster together; if a tile size of 32 works well, a tile size of 16 or 64 is likely better than a random value like 7.Model-Based Search with Cost ModelsThe most advanced auto-tuning systems, such as those found in TVM or Halide, employ a model-based approach. This method uses the statistical cost models discussed in the previous section to guide the search. Instead of running every candidate on hardware, the compiler uses a lightweight machine learning model (the cost model) to predict the performance of thousands of candidates.The workflow proceeds in rounds:Candidate Generation: The search policy proposes a large batch of potential schedules.Prediction: The cost model predicts the runtime for this batch. This is fast because it happens entirely in software.Selection: The system selects the top $k$ candidates with the best predicted scores.Measurement: These chosen few are compiled and run on the actual hardware device to get the ground-truth execution time.Feedback: The measured data (schedule features + actual time) is added to the training dataset.Update: The cost model is retrained or updated with the new data to improve its accuracy for the next round.digraph G { rankdir=TB; bgcolor="transparent"; node [style=filled, shape=box, fontname="Arial", fontsize=12, color="#ced4da"]; edge [color="#868e96"]; start [label="Start Tuning", fillcolor="#a5d8ff"]; generate [label="Generate Candidates", fillcolor="#ffffff"]; predict [label="Predict with Cost Model", fillcolor="#ffffff"]; select [label="Select Top Candidates", fillcolor="#b2f2bb"]; benchmark [label="Run on Hardware", fillcolor="#ffc9c9"]; update [label="Update Cost Model", fillcolor="#bac8ff"]; check [label="Criteria Met?", shape=diamond, fillcolor="#e9ecef"]; end [label="Output Best Schedule", fillcolor="#a5d8ff"]; start -> generate; generate -> predict; predict -> select; select -> benchmark; benchmark -> update; update -> check; check -> generate [label="No"]; check -> end [label="Yes"]; }The cycle of model-based optimization allows the compiler to learn the performance characteristics of the hardware during the tuning process.This approach is significantly more sample-efficient than random search. The cost model quickly learns to reject bad configurations (like tile sizes that cause cache thrashing), allowing the physical benchmarking time to be spent only on promising candidates.Simulated Annealing and XGBoostTo navigate the space effectively, compilers often combine a global search strategy with a local one. A common implementation uses Simulated Annealing to explore the space defined by the cost model. The cost function acts as an energy model where the algorithm tries to find the lowest point (minimum latency).Simultaneously, gradient boosting trees (like XGBoost) are frequently used as the cost model itself. They are effective against the discrete and non-linear nature of compiler parameters. The search algorithm generates thousands of candidates, the XGBoost model ranks them, and only the top percent are sent to the hardware runner.Convergence and Stopping CriteriaAn auto-tuning session cannot run indefinitely. We must define stopping criteria to determine when the search is sufficient. Common criteria include:Number of Trials: Stop after measuring $N$ configurations on hardware (e.g., 1000 trials).Time Budget: Stop after the tuning session has run for a specific duration (e.g., 30 minutes).Early Stopping: Stop if the performance has not improved after a certain number of consecutive iterations.The chart below illustrates how different search strategies might converge over time. Notice how model-based search (guided) typically achieves higher performance with fewer hardware trials compared to random search.{"layout": {"title": "Convergence of Search Strategies", "xaxis": {"title": "Number of Hardware Trials"}, "yaxis": {"title": "Throughput (GFLOPS)"}, "template": "simple_white", "width": 600, "height": 400, "legend": {"x": 0.7, "y": 0.1}}, "data": [{"x": [0, 100, 200, 300, 400, 500, 600, 700, 800], "y": [10, 15, 18, 20, 22, 23, 24, 24, 24.5], "type": "scatter", "mode": "lines+markers", "name": "Random Search", "line": {"color": "#adb5bd"}}, {"x": [0, 100, 200, 300, 400, 500, 600, 700, 800], "y": [10, 25, 40, 55, 65, 70, 72, 73, 73.5], "type": "scatter", "mode": "lines+markers", "name": "Model-Based Search", "line": {"color": "#4c6ef5"}}]}Performance improvements over successive trials. The guided search uses historical data to find optimal parameters faster.Transfer Learning in SearchA challenge with auto-tuning is that it usually starts from scratch for every new model or operator. To mitigate this, modern research focuses on Transfer Learning. If the compiler has already tuned a matrix multiplication of size $1024 \times 1024$, it should use that knowledge when tuning a multiplication of size $1024 \times 512$.In this scenario, the search algorithm initializes the cost model with weights pre-trained on similar workloads. This gives the search a "warm start," reducing the time required to find a good schedule. Instead of exploring the entire space blindly, the search focuses immediately on regions that were profitable for similar tensor shapes in the past.Practical ImplementationWhen configuring an automated search in frameworks like TVM or MLIR, you are essentially configuring the parameters of these algorithms. You define the specific hardware target (which dictates the instruction set and memory limits), the number of trials, and the complexity of the search policy.By understanding the mechanics of these search algorithms, you can better diagnose tuning issues. If the performance plateaus too early, the search might be stuck in a local optimum (too much exploitation). If the performance is erratic and slow to improve, the search space might be too large or the cost model inaccurate (too much exploration without guidance). Optimizing the optimizer is often the final step in deploying high-performance ML models.