Deploying trained machine learning models efficiently presents distinct challenges compared to the training phase. A common issue is the performance difference observed between model development and its execution in production environments. This chapter establishes the foundation for understanding this performance gap and introduces the role of specialized compilers and runtimes in addressing it.
We will examine the components of typical ML execution stacks and pinpoint common performance bottlenecks related to compute, memory, and latency. You will gain an overview of the diverse hardware targets used for ML acceleration, including CPUs, GPUs, and custom silicon, understanding how their characteristics impact optimization strategies. We conclude by explaining why general-purpose compilation techniques are often insufficient for complex ML workloads, motivating the need for the advanced, domain-specific optimizations covered throughout this course.
1.1 The ML Model Deployment Gap
1.2 Overview of ML Compiler and Runtime Stacks
1.3 Performance Bottlenecks in ML Inference
1.4 Hardware Landscape for ML Acceleration
1.5 The Need for Specialized Optimizations
© 2025 ApX Machine Learning