Okay, you've successfully trained a machine learning model and saved it using techniques from the previous section. Now what? A saved model file sitting on a disk isn't particularly useful on its own. To make predictions on new data, especially for applications or other services, you need a way to serve this model. This means making its prediction capabilities available over a network, typically through an Application Programming Interface (API).
Model serving is the process of taking a trained model and deploying it so that other systems can send data to it and receive predictions back. Think of it like making your model a callable function, but one that can be accessed remotely.
You could technically write a raw network server in Python to listen for incoming connections, parse requests, load the model, make predictions, and send responses. However, this involves handling many low-level details: managing network sockets, parsing HTTP requests, handling concurrent connections, serializing/deserializing data, and routing requests to the correct logic. This is tedious and error-prone.
This is where model serving frameworks, particularly web frameworks adapted for this purpose, come in. They provide structure and handle the boilerplate networking and request-handling code, letting you focus on the machine learning specific parts: loading your model and defining the prediction logic.
For many common scenarios, standard Python web frameworks are excellent tools for creating simple model serving APIs. They act as the bridge between the network (receiving requests) and your Python code (loading the model, making predictions). Two popular choices are:
Flask: A lightweight "microframework." It's known for its simplicity and minimal core, making it easy to get started. You add components (like database integration or authentication) as needed. Its simplicity is often ideal for straightforward prediction APIs where complex web application features aren't required.
FastAPI: A modern, high-performance web framework. It leverages Python type hints for automatic data validation, serialization, and interactive API documentation (using OpenAPI/Swagger standards). It's built on Starlette (for web handling) and Pydantic (for data validation), offering asynchronous request handling (async
/await
) for potentially higher throughput compared to traditional synchronous frameworks like Flask under high load. Its built-in documentation generation is a significant advantage for API usability.
While Flask might be simpler for a very basic first API, FastAPI often provides a better development experience and performance characteristics for production-oriented services due to its type safety, async capabilities, and auto-documentation.
A typical request flow for a model serving API built with a web framework.
Beyond general-purpose web frameworks, there are also specialized tools designed explicitly for machine learning model serving, especially at scale:
These specialized tools often provide features like automatic batching of requests, optimized hardware utilization (especially GPUs), built-in monitoring, and robust version management, which become increasingly important as deployment complexity grows.
For this course, we'll focus on using common Python web frameworks like Flask or FastAPI, as they provide a clear understanding of the fundamental concepts involved in creating a prediction API. They strike a good balance between simplicity and capability for many common use cases and serve as a solid foundation before moving to more specialized platforms if needed. The next section will guide you through building your first prediction API using one of these frameworks.
© 2025 ApX Machine Learning