The search space for loop optimizations is large, with a single matrix multiplication operator potentially having billions of valid schedule configurations when accounting for tile sizes, vectorization factors, and loop reordering permutations. If benchmarking a single configuration on physical hardware takes even a fraction of a second, evaluating the entire space is computationally impossible.To navigate this massive set of possibilities efficiently, we cannot rely on hardware measurements for every candidate. Instead, we use a cost model. A cost model serves as a mathematical proxy for the hardware. It predicts the performance of a specific loop schedule without actually running the code. This allows the auto-tuner to simulate millions of configurations per second, filtering out poor candidates and identifying promising ones for actual verification.The Role of the Cost ModelThe fundamental goal of the cost model is to approximate the execution time of a generated kernel. If we define a schedule configuration as $S$ and the actual hardware execution time as $T(S)$, the cost model $f$ attempts to minimize the error:$$f(S) \approx T(S)$$In a typical auto-tuning pipeline, the search algorithm proposes a batch of candidate schedules. The cost model evaluates these candidates and assigns a score or predicted latency to each. Only the candidates with the best predicted scores are sent to the hardware for benchmarking. This dramatically reduces the tuning time by filtering out inefficient schedules early in the process.There are two primary categories of cost models used in machine learning compilers: analytical models and learned models.Analytical Cost ModelsEarly compilers relied heavily on analytical modeling. These models use predefined formulas based on hardware specifications. An analytical model works by counting operations and estimating latency based on instruction costs.For example, a simple analytical model for a CPU might calculate cost as:$$Cost = (N_{flops} \times C_{flop}) + (N_{load} \times C_{mem}) + (N_{miss} \times C_{penalty})$$Where $C$ represents the cost in clock cycles for floating-point operations, memory loads, and cache misses respectively.While analytical models are extremely fast to evaluate, they often fail to capture the complexity of modern accelerators. Modern GPUs and TPUs have complex behaviors such as out-of-order execution, bank conflicts, and instruction pipelining that are difficult to express in a static formula. Consequently, a schedule that looks efficient analytically might perform poorly in practice due to hardware-specific bottlenecks like shared memory bank conflicts.Learned Cost ModelsModern frameworks like TVM and Halide have moved toward learned cost models. Instead of relying on a fixed formula, these systems use machine learning to predict the performance of other machine learning workloads. This approach treats performance prediction as a regression or ranking problem.A learned cost model typically uses a lightweight statistical method, such as XGBoost (Gradient Boosted Trees) or a small Multi-Layer Perceptron (MLP). The workflow operates as follows:Feature Extraction: The compiler extracts features from the loop schedule. These features might include the number of memory accesses, loop nesting depth, vectorization width, and data reuse patterns.Training: The model trains on pairs of $(Schedule, ExecutionTime)$.Inference: The model predicts scores for new, unseen schedules.This approach is effective because it adapts to the specific hardware. If you tune a model on an NVIDIA A100, the cost model learns the performance characteristics of the A100. If you switch to a mobile ARM CPU, the cost model retrains and learns the distinct characteristics of that architecture.Feature Extraction for Loop SchedulesTo train a model, the Abstract Syntax Tree (AST) of the loop nest must be converted into a feature vector. Common features extracted by ML compilers include:Memory Access Patterns: Is the memory access contiguous (friendly to vectorization) or strided (leads to cache misses)?Arithmetic Intensity: The ratio of compute operations to memory operations.Loop Structure: The size of the loops and the extent of unrolling.Parallelism: How many threads are utilized and how the workload is distributed.The diagram below illustrates the flow of data from the search space through the cost model and into the hardware measurement phase.digraph CostModelFlow { rankdir=LR; node [shape=box, style="filled,rounded", fontname="Arial", fontsize=12, margin=0.2]; edge [fontname="Arial", fontsize=10, color="#868e96"]; SearchSpace [label="Search Space\n(Valid Schedules)", fillcolor="#e9ecef", color="#adb5bd"]; Sampler [label="Sampler\n(Select Candidates)", fillcolor="#a5d8ff", color="#4dabf7"]; CostModel [label="Cost Model\n(XGBoost/MLP)", fillcolor="#ffc9c9", color="#fa5252"]; TopK [label="Filter Top-K\nCandidates", fillcolor="#b2f2bb", color="#40c057"]; Hardware [label="Hardware\nMeasurement", fillcolor="#b197fc", color="#7950f2"]; Database [label="Training Data\n(Schedule, Time)", fillcolor="#ffe066", color="#fcc419"]; SearchSpace -> Sampler; Sampler -> CostModel [label="Query"]; CostModel -> TopK [label="Predicted\nScores"]; TopK -> Hardware [label="Compile &\nRun"]; Hardware -> Database [label="Real\nLatency"]; Database -> CostModel [label="Retrain\nModel"]; }The iterative workflow of a learned cost model showing how hardware measurements refine predictions over time.The Exploration-Exploitation Trade-offA critical challenge in auto-tuning is the balance between exploration and exploitation.Exploitation: Choosing schedules that the cost model predicts are fast. This relies on the current knowledge of the model to find the best performance.Exploration: Choosing schedules where the cost model is uncertain. This helps the model learn about parts of the search space it has not seen yet.If the tuner only exploits, it might get stuck in a local minimum because the cost model is inaccurate in unexplored regions. If it only explores, it wastes time measuring random, likely poor, schedules.To manage this, auto-tuning processes usually employ an epsilon-greedy strategy or Upper Confidence Bound (UCB) approach. In the early rounds of tuning, the system randomly samples the search space to build a diverse dataset. As the cost model becomes more accurate (lower training error), the system shifts toward selecting the top-predicted candidates to converge on the optimal schedule.Metric: Rank vs. Absolute LatencyInterestingly, a cost model does not need to predict the exact execution time perfectly to be useful. It only needs to rank the schedules correctly. If Schedule A is faster than Schedule B on hardware, the cost model must predict $f(A) < f(B)$.Because relative ordering is more important than absolute precision, many auto-tuners use ranking loss functions (like pairwise ranking loss) during training rather than standard Mean Squared Error (MSE). This ensures the model focuses on distinguishing good schedules from bad ones, rather than trying to predict the exact number of nanoseconds a kernel will take.The following chart demonstrates how a cost model improves over tuning trials. Initially, the correlation between predicted rank and actual hardware rank is low. As more data is collected, the model aligns closer to reality.{"layout": {"title": "Cost Model Accuracy Over Tuning Epochs", "xaxis": {"title": "Tuning Epochs"}, "yaxis": {"title": "Ranking Correlation (0-1)"}, "height": 400, "margin": {"t": 40, "b": 40, "l": 60, "r": 20}, "plot_bgcolor": "#f8f9fa", "paper_bgcolor": "#ffffff"}, "data": [{"x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "y": [0.15, 0.35, 0.55, 0.68, 0.75, 0.82, 0.85, 0.88, 0.89, 0.91], "type": "scatter", "mode": "lines+markers", "line": {"color": "#228be6", "width": 3}, "marker": {"size": 8, "color": "#1c7ed6"}, "name": "Prediction Accuracy"}]}Improvement in ranking correlation as the auto-tuner collects more hardware measurements.Hardware-Specific NotesWhen configuring a cost model for a specific target, different factors influence the model's complexity:CPUs: Performance is often dominated by cache locality and vectorization. Cost models for CPUs focus heavily on loop tiling sizes and their relationship to L1/L2 cache sizes.GPUs: Parallelism and memory latency hiding are critical. The cost model must learn to favor schedules that map well to thread blocks and warps while minimizing global memory round-trips.Edge Accelerators: These devices often have limited instruction sets or specialized memory buffers. A cost model here relies on strictly adhering to valid instruction constraints, often penalizing any schedule that spills data to the slower system RAM.By effectively using a cost model, we decouple the search for efficiency from the physical limitations of benchmarking. This allows the compiler to evaluate billions of potential programs, eventually generating code that rivals or exceeds hand-tuned libraries like cuDNN or MKL.