After a model has been trained and validated, it holds potential value. However, that potential is only realized when the model's predictions are made available to users or other software systems. This transition from a trained artifact to a working service is known as model deployment. It's the stage where your model graduates from the lab and starts performing its job in a production environment.
Choosing the right deployment strategy is a critical decision that depends entirely on the requirements of your application. The primary question to ask is: How and when does your application need predictions? The answer will guide you toward one of the common deployment patterns.
The simplest deployment strategy is batch prediction, also known as offline scoring. In this approach, the model processes a large collection, or "batch," of observations at once. This process typically runs on a fixed schedule, for example, once a day, and the predictions are stored in a database or file system for later use.
You should consider a batch prediction strategy when:
A typical batch prediction workflow involves a scheduled job that reads input data from a source like a data warehouse, feeds it to the model, and writes the resulting predictions to a destination table.
A high-level view of a batch prediction system. A scheduled job processes data in bulk and stores the output.
In contrast to the batch approach, online prediction, or real-time serving, generates predictions as they are requested. The model is wrapped in an API and deployed as a persistent service, often behind a load balancer, ready to respond to incoming requests with very low latency.
This strategy is essential for applications that require immediate feedback. Common use cases include:
Online serving architectures are generally more complex than batch systems. They require scalable infrastructure to handle request traffic and must be monitored closely to ensure high availability and low latency. The model is typically exposed via a REST API endpoint, allowing other services to request predictions by sending a simple HTTP request with the input data.
An online prediction system responds to individual requests in real time.
The table below summarizes the main differences between these two primary strategies.
| Feature | Batch Prediction | Online Prediction |
|---|---|---|
| Data | Large, static datasets | Single or few data points |
| Trigger | Scheduled (e.g., hourly, daily) | On-demand (API call) |
| Latency | High (minutes to hours) | Low (milliseconds) |
| Throughput | High | Low to high (scalable) |
| Use Case | Non-interactive reporting | Interactive applications |
| Infrastructure | Simpler (e.g., cron job) | More complex (e.g., API server) |
In some cases, a hybrid approach might be used. For instance, an e-commerce site could use a batch job to pre-calculate recommendations for all users overnight, while an online model provides real-time adjustments based on a user's current session.
Understanding these patterns is fundamental to designing an effective MLOps workflow. The chosen strategy directly impacts how you will version, test, deploy, and monitor your model, which are topics we will cover throughout the rest of this course.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with