Okay, you've trained your model, and it looks promising on your test dataset. But how do you actually use it to make predictions on new data outside of your development environment? This is where deployment strategies come into play. There isn't just one way to deploy a model; the best approach depends heavily on how and when you need the predictions. Let's look at the most common ways models are put into practice.
Imagine you have a large amount of data that needs predictions, but you don't need those predictions instantly. Maybe you want to predict customer churn for all your customers once a week, or generate daily sales forecasts for all your stores. This scenario is perfect for batch prediction, also known as offline inference.
In batch prediction, the model processes data in large groups or "batches" on a predetermined schedule (e.g., hourly, daily, weekly).
Here’s a simplified view of how it works:
A typical workflow for batch prediction where data is processed periodically.
When is Batch Prediction suitable?
Think of it like processing payroll. You don't calculate pay constantly; you do it in a batch every pay period.
Now, consider situations where you need a prediction right now. For example:
These use cases require online prediction, also known as real-time inference. Here, the model is typically running continuously as part of a service, waiting to receive prediction requests. When a request with new input data arrives (often just a single data point or a small number of points), the service quickly loads the model (if not already loaded), generates the prediction, and sends it back immediately.
This usually involves setting up a prediction service, often using a web framework like Flask or FastAPI (which we'll explore later in this course). Applications interact with this service through an Application Programming Interface (API).
A typical workflow for online prediction where predictions are served on demand via an API.
When is Online Prediction suitable?
Think of this like a web search. You type a query and expect results almost instantly.
A third, less common but important strategy involves embedding the model directly within an application or device. Instead of calling a separate service, the model runs locally on the user's device (like a smartphone or an IoT sensor) or as part of a larger software application.
Examples include:
This approach is useful when network connectivity is unreliable, latency needs to be extremely low, or data privacy is a major concern (data doesn't need to leave the device). However, it often requires models that are specifically optimized for size and computational efficiency to run on resource-constrained devices.
The choice between batch, online, or embedded deployment depends entirely on the requirements of your application:
Often, the nature of the problem itself dictates the strategy. In this course, we will primarily focus on the techniques needed for online prediction, as building prediction services is a common and foundational skill in MLOps. We'll learn how to save models, wrap them in a web service using Flask, and package the service for deployment. Understanding these different strategies provides context for why we build these services and how they fit into the larger picture of making machine learning useful.
© 2025 ApX Machine Learning