When working with frameworks like PyTorch or TensorFlow, the development experience feels immediate and interactive. You define a tensor, perform an addition, and print the result. This design choice, often referred to as eager execution, prioritizes developer productivity and debugging ease. However, this convenience obscures a significant disconnect between how Python programs describe operations and how hardware accelerators execute them. This disconnect is known as the framework-hardware gap.
The gap exists because modern deep learning accelerators, such as GPUs and TPUs, rely on massive parallelism and high throughput to deliver performance. In contrast, the Python interpreter running on the CPU is fundamentally serial and dynamic. Bridging these two environments introduces overhead that can severely limit the performance of a machine learning model, regardless of how powerful the underlying hardware might be.
In an interpreted environment, the framework executes operations one by one. When the interpreter encounters a line of code like c = a + b, it performs a series of checks before any calculation occurs. It must verify the data types, check tensor shapes for compatibility, allocate memory for the result, and finally select the appropriate implementation (kernel) for the hardware backend.
This process is called dynamic dispatch. While the overhead of dispatching a single operation is measured in microseconds, a deep neural network might involve hundreds of thousands of such operations. If the actual mathematical computation on the GPU takes less time than the CPU spends preparing and dispatching the instruction, the accelerator sits idle waiting for work. This creates a scenario where the program is "CPU-bound," meaning the expensive GPU resources are underutilized.
The following diagram illustrates the flow of eager execution, highlighting the latency introduced by the host CPU between GPU kernel executions.
Sequence of operations in eager execution showing how CPU dispatch overhead creates idle gaps in GPU utilization.
The second major component of the gap involves how data moves through memory. In a standard framework implementation, every operation is treated as a distinct function call. Consider a typical layer in a neural network that performs a convolution, adds a bias, and applies a ReLU activation function:
y=ReLU(Conv(x,W)+b)
Executed eagerly, the framework handles this as three separate steps:
This pattern is inefficient because accessing global memory (VRAM) is significantly slower than performing arithmetic logic. The data is repeatedly moved back and forth between the chip's compute units and its main memory.
Hardware accelerators perform best when the ratio of arithmetic operations to memory accesses (arithmetic intensity) is high. By treating operators as granular, isolated units, frameworks artificially lower this intensity. The memory bandwidth becomes the bottleneck, capping the effective throughput long before the compute cores are saturated.
The chart below compares the time breakdown of a naive execution versus an optimized approach where overhead and memory access are minimized.
Comparison of time spent in computation versus memory access and dispatch overhead. Optimized execution significantly reduces non-compute tasks.
Every time a framework instructs the GPU to run a function, it initiates a "kernel launch." This launch tells the GPU scheduler which code to run, with what parameters, and how to map threads to data.
For large matrix multiplications (like those in Large Language Models), the computation takes long enough that the launch cost is negligible. However, modern network architectures often contain many small, element-wise operations (activations, normalizations, reshapes). For these small operators, the time taken to configure and launch the kernel can exceed the time taken to execute it.
If a model requires launching 50 distinct kernels to process a single input image, and each launch incurs a fixed overhead, the latency accumulates. This is particularly problematic for inference workloads where low latency is required.
To resolve these inefficiencies, we cannot rely solely on faster Python interpreters or faster hardware. The solution lies in restructuring the execution strategy itself.
The ML compiler stack exists primarily to automate these optimizations, translating the high-level mathematical intent of frameworks into the rigid, highly efficient instruction streams required by hardware.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with