When deploying a machine learning model, the primary architectural decision revolves around a fundamental tension: minimizing latency versus maximizing throughput. These two goals are often at odds, and the optimal balance is dictated entirely by your application's requirements. An architecture designed for instantaneous fraud detection looks very different from one built for overnight processing of user-uploaded videos. Understanding this trade-off is the first step in engineering a successful and cost-effective inference service.
Before we analyze architectural patterns, it is important to precisely define our performance metrics. In the context of model serving, these terms have specific meanings that drive engineering decisions.
Latency is the time taken to process a single inference request. It is typically measured from the moment the request is received by the service to the moment a response is sent back. For user-facing applications, this is often governed by a strict Service Level Objective (SLO), such as a 99th percentile latency (p99) of less than 100 milliseconds. This means 99 out of 100 requests must complete faster than this threshold. We must also consider cold start latency, the additional delay incurred for the very first request that requires loading a model into memory or warming up a new compute instance.
Throughput is the total number of inference requests a service can handle in a given period, usually measured in requests per second (RPS) or inferences per second. Throughput is a measure of system capacity and is directly related to cost-efficiency. A high-throughput system effectively utilizes its underlying hardware (like GPUs), processing more data for the same fixed cost.
The core challenge is that techniques used to improve throughput, such as processing requests in large batches, inherently increase the latency for any individual request.
Your choice of architecture will fall somewhere on a spectrum between pure latency optimization and pure throughput optimization.
For applications where immediate responses are critical, such as interactive chatbots, real-time bidding, or driver-assist systems, latency is the dominant metric. The goal is to process each incoming request as quickly as possible, with minimal delay.
The typical architecture involves:
The primary drawback of this approach is poor hardware utilization. A modern GPU is a massively parallel processor designed to handle large volumes of data simultaneously. Sending a single request with a batch size of 1 leaves most of its computational cores idle, leading to a high cost per inference.
A latency-optimized architecture distributes individual requests across multiple model replicas to ensure immediate processing.
For offline tasks where latency is not a concern, the goal shifts to maximizing efficiency and minimizing total cost. Examples include generating embeddings for a document corpus, pre-computing user recommendations overnight, or analyzing a batch of medical images.
This architecture is characterized by:
The obvious trade-off is extremely high latency. A request arriving just after a batch has started processing must wait for the entire current batch to finish and a new batch to fill before it is handled.
Most modern services require a practical balance: they need to be responsive but also cost-effective. This is where dynamic batching comes into play. It provides a middle ground that improves throughput with only a minimal, controlled increase in latency.
The mechanism is simple yet effective:
Dynamic batching groups requests that arrive close together in time, increasing GPU utilization with a slight latency trade-off.
This technique is a foundation of high-performance serving. By sacrificing a few milliseconds of latency, you can dramatically increase your service's throughput. For instance, you might turn a p99 latency of 15ms into 25ms, but in doing so, increase your server's capacity from 100 RPS to 400 RPS. This trade-off is almost always worthwhile for services that are not on the absolute bleeding edge of real-time requirements.
Choosing the right point on the latency-throughput spectrum sets the stage for all subsequent optimizations. Once you have an architectural pattern in mind, the next steps, which we will cover in the following sections, involve optimizing the model itself and using specialized serving software like Triton to implement these patterns effectively.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with