Selecting instances for inference and serving presents a distinct optimization challenge. While model training is often a temporary, high-cost batch job, inference is typically a 24/7 service. Here, low latency and cost-per-prediction are the primary metrics. Incorrect instance selection inflates operational costs continuously, rather than just wasting budget on a single job.
The primary goal of inference serving is to return a prediction as quickly and cheaply as possible. This leads to a different set of hardware considerations. A powerful, multi-GPU instance that excels at training might be wasteful and slow for serving single, real-time requests.
For many models, especially those outside the large language model (LLM) or generative image space, CPU instances are the most practical and cost-effective choice for serving. This includes traditional machine learning models from libraries like Scikit-learn and XGBoost, as well as smaller deep learning models.
CPUs excel at handling individual, low-latency requests. Since each inference request is often processed independently, the massive parallelism of a GPU can go underutilized, making it an expensive, idle resource.
Choose a CPU instance when:
Cloud providers offer a spectrum of CPU-based virtual machines. General-purpose instances (like AWS m5 series or GCP e2 series) provide a balanced mix of CPU and memory and are a good starting point. If profiling reveals your model is CPU-bound, compute-optimized instances (like AWS c5 series or GCP c2 series) offer more powerful cores for the same amount of memory.
When your application serves a large, complex deep learning model or handles a high volume of concurrent requests, a GPU becomes necessary. The key to using a GPU cost-effectively for inference is to maximize its utilization. A GPU that processes one request at a time is inefficient. The goal is to batch multiple incoming requests together and process them simultaneously, leveraging the GPU's parallel architecture.
This strategy transforms the problem from being latency-bound to throughput-bound. While the latency for any single request might increase slightly due to the need to wait for a batch to fill, the overall number of inferences per second (and thus the cost per inference) improves dramatically.
Cloud providers now offer GPUs designed specifically for inference, which are more economical than the top-tier training GPUs:
g4dn instances, GCP n1-standard with T4).g5 instances).The decision to use a GPU relies on your ability to implement a batching strategy. This can be done within your application logic or by using a dedicated model serving framework like NVIDIA's Triton Inference Server, which can handle request batching automatically.
A decision flow for selecting an inference instance. The process involves evaluating model size, traffic, and latency requirements to determine the most suitable hardware.
For organizations operating at a very large scale, cloud providers offer custom-designed Application-Specific Integrated Circuits (ASICs) built for one purpose: efficient, low-cost model inference.
inf1 and inf2 are often significantly cheaper for inference at scale than their GPU counterparts.The trade-off for using these ASICs is flexibility. They support a specific set of model architectures and operations, and the required compilation step adds engineering overhead. However, for stable, high-volume workloads, the cost savings can be substantial.
Choosing an instance type is only the first step. To manage costs effectively, you must match your provisioned capacity to the actual demand.
INT8 instead of FP32 precision) and pruning, which are covered in Chapter 5, can dramatically reduce model size and accelerate inference speed, allowing you to use smaller, cheaper instances.The final decision is a balance between performance, cost, and engineering effort. A simple CPU-based deployment is easy to manage, while a highly optimized ASIC-based solution requires more specialized work but offers the lowest cost at scale.
| Instance Category | Best For | Main Consideration | Example Cloud Offerings |
|---|---|---|---|
| CPU (General Purpose) | Low-to-medium traffic, latency-sensitive apps, traditional ML models. | Simplicity and low idle cost. | AWS m series, GCP e2/n2, Azure DSv4 |
| GPU (Inference-Optimized) | High-throughput, large deep learning models, batchable requests. | Throughput and performance. | AWS g4dn/g5 (T4/A10G), GCP n1 with T4/L4 |
| Specialized ASIC | Very high-volume, stable workloads for maximum cost efficiency. | Lowest cost-per-inference at scale. | AWS inf series (Inferentia), GCP TPU instances |
| Serverless Compute | Infrequent or highly unpredictable traffic. | Pay-per-use, no idle cost. | AWS Lambda, Google Cloud Run, Azure Functions |
Was this section helpful?
© 2026 ApX Machine LearningEngineered with