Evaluating thousands of candidate schedules on physical hardware creates a significant bottleneck in the compilation pipeline. While a genetic algorithm or simulated annealing solver can generate valid configurations rapidly, the actual execution of a kernel on a GPU or TPU takes milliseconds to seconds. When the search space contains billions of possibilities, hardware-in-the-loop verification becomes computationally prohibitive.To bridge the gap between the massive search space and limited hardware availability, auto-tuners employ statistical cost models. A cost model acts as a high-throughput surrogate for the hardware, predicting the performance of a specific schedule without executing it. This allows the search algorithm to query the performance of thousands of candidates per second, filtering down to only the most promising configurations for actual hardware benchmarking.The Role of the SurrogateFormally, we define the optimization problem as finding a schedule $s$ from the space of valid schedules $S$ that minimizes the execution time on a hardware target $H$. The true cost function, $Cost(s, H)$, is expensive to evaluate. The cost model provides an approximation function $\hat{f}(s)$ such that:$$ \hat{f}(s) \approx Cost(s, H) $$The search algorithm interacts primarily with $\hat{f}(s)$. The workflow follows an active learning approach. The tuner starts with no knowledge of the hardware's response to the specific operator. It randomly selects a small batch of schedules, measures them on the device, and uses the resulting latency data to train the initial cost model.As tuning progresses, the model predicts the performance of the next batch of candidates. The top candidates are measured on hardware, and these ground-truth measurements are added to the training set to update the model. This feedback loop continuously refines the model's accuracy in the regions of the search space that matter most.digraph G { rankdir=TB; node [shape=box, style=filled, fillcolor="#e9ecef", fontname="Helvetica", fontsize=10]; edge [fontname="Helvetica", fontsize=9]; subgraph cluster_0 { label = "Host Compiler"; style=dashed; color="#adb5bd"; SearchPolicy [label="Search Algorithm\n(Genetic/Simulated Annealing)", fillcolor="#d0bfff"]; CostModel [label="Statistical Cost Model\n(XGBoost/LightGBM/MLP)", fillcolor="#a5d8ff"]; FeatureExtract [label="Feature Extraction\n(AST -> Vectors)", fillcolor="#bac8ff"]; } Hardware [label="Hardware Measurement\n(GPU/Accelerator)", fillcolor="#ffc9c9"]; Dataset [label="Training Dataset\n{(Schedule, Latency)}", shape=cylinder, fillcolor="#ffe066"]; SearchPolicy -> FeatureExtract [label="Generates Candidates"]; FeatureExtract -> CostModel [label="Input Vectors"]; CostModel -> SearchPolicy [label="Predicted Score"]; SearchPolicy -> Hardware [label="Top-k Candidates"]; Hardware -> Dataset [label="Execution Time"]; Dataset -> CostModel [label="Retraining (Gradient Update)"]; }The active learning cycle in auto-tuning. The cost model serves as a filter, allowing the search policy to explore millions of candidates while only running the most promising ones on physical hardware.Feature Engineering for ProgramsMachine learning models require numerical input, but a loop schedule is a structured syntactic object. Converting an Abstract Syntax Tree (AST) or a specific loop configuration into a feature vector is a critical step in building an effective cost model.Statistical Feature ExtractionIn systems like AutoTVM, feature extraction relies on expert-defined rules. The compiler traverses the low-level intermediate representation (IR) of the loop nest and extracts statistical counters. Common features include:Memory Access Patterns: The number of continuous versus strided memory accesses. This helps the model predict cache coherence and memory bandwidth utilization.Arithmetic Intensity: The ratio of floating-point operations to memory bytes transferred.Loop Structure: The depth of the loop nest, the extent of each loop, and the unroll factors.Vectorization: The length of vector instructions and the utilization of SIMD lanes.These features are flattened into a fixed-length vector and fed into gradient-boosted decision trees (GBDT), such as XGBoost. Trees are particularly effective here because they handle the non-linear interactions between features (e.g., tiling size vs. cache size) without requiring data normalization.Deep Learning Based ExtractionNewer systems like Ansor (Auto-scheduler) utilize deep learning to learn the feature representation automatically. Instead of hand-crafted counters, these systems treat the program as a graph or a sequence.For example, a schedule can be encoded as a tree where each node represents a loop or a statement. A Tree-LSTM (Long Short-Term Memory) network or a Graph Neural Network (GNN) processes this structure to generate an embedding vector. This embedding captures complex dependencies, such as the distance between a definition and its usage, which might be lost in simple statistical counters.Ranking over RegressionWhile the goal is to minimize latency, the cost model does not necessarily need to predict the exact execution time in milliseconds. It is more important that the model correctly ranks the relative performance of two schedules. If Schedule A is faster than Schedule B on hardware, the cost model should predict a lower score for A than for B, even if the absolute values are inaccurate.Consequently, modern cost models often optimize a pairwise ranking loss rather than a standard Mean Squared Error (MSE) regression loss.Given two schedules $s_i$ and $s_j$ with measured latencies $y_i$ and $y_j$, and predicted scores $\hat{y}_i$ and $\hat{y}_j$, a RankNet-style loss function minimizes the probability of incorrect ordering. We define the probability that $s_i$ is faster than $s_j$ using a sigmoid function $\sigma$:$$ P_{ij} = \frac{1}{1 + e^{-(\hat{y}_i - \hat{y}_j)}} $$The loss function then penalizes discrepancies between the predicted probability and the actual ordering observed on hardware. This approach makes the model resilient to noise in hardware measurements and focuses the learning capacity on distinguishing good schedules from bad ones, rather than learning the precise physics of the GPU.Search Space EfficiencyThe efficiency gain from using a cost model is substantial. A typical tuning session for a Convolutional Neural Network (CNN) layer might involve exploring a search space of size $10^{10}$.Without a cost model, random search or simple heuristics might stagnate, finding only local minima. With a cost model, the search process effectively "simulates" thousands of steps for every actual hardware measurement. The following chart illustrates the convergence behavior of different search strategies.{ "layout": { "title": "Convergence Speed: Random Search vs. Cost Model Guided", "xaxis": { "title": "Number of Hardware Measurement Trials", "showgrid": true, "gridcolor": "#dee2e6" }, "yaxis": { "title": "Throughput (GFLOPS)", "showgrid": true, "gridcolor": "#dee2e6" }, "plot_bgcolor": "white", "paper_bgcolor": "white", "width": 700, "height": 400, "legend": { "x": 0.7, "y": 0.1 } }, "data": [ { "x": [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500], "y": [100, 150, 180, 200, 210, 215, 220, 225, 230, 232, 235], "type": "scatter", "mode": "lines+markers", "name": "Random Search", "line": {"color": "#adb5bd", "dash": "dash"} }, { "x": [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500], "y": [100, 400, 650, 800, 880, 920, 940, 950, 955, 960, 962], "type": "scatter", "mode": "lines+markers", "name": "XGBoost Cost Model", "line": {"color": "#339af0", "width": 3} } ] }Comparison of tuning convergence. The cost model (blue) rapidly identifies high-performance regions, achieving near-peak throughput within 300 trials, whereas random search (gray) improves slowly.Transfer Learning in Cost ModelsOne significant limitation of standard cost models is that they are typically task-specific. A model trained to optimize a Conv2d operator with shape $224 \times 224$ usually cannot predict the performance of a MatMul or even a Conv2d with a different shape. This requires the auto-tuner to start from scratch for every new subgraph in a neural network.To address this, advanced compilers implement transfer learning. By training on a massive dataset of diverse operators and schedules across various hardware targets, a base model learns general principles of code efficiency, such as the penalty of unaligned memory access or the benefit of register tiling. When tuning a new operator, the system starts with this pre-trained model and fine-tunes it with a few logical samples. This significantly reduces the "cold start" period, allowing the tuner to find optimal schedules with far fewer hardware measurements.