Deploying a machine learning model into production is a significant milestone, but it is far from the final step. A deployed model is not a static piece of software. it is a dynamic system that interacts with a constantly changing environment. Its performance can and will degrade over time. The practice of systematically tracking a model's behavior and performance after deployment is known as monitoring. It is the essential process that tells you when your model is no longer working as intended.
Once a model is live, it begins to encounter data that it has never seen before. Over time, the nature of this new data can diverge from the data the model was trained on, causing its predictions to become less accurate. This degradation generally happens for two main reasons: data drift and concept drift.
Data drift occurs when the statistical properties of the input features change over time. The model itself might still be valid, but the data it receives no longer matches the patterns it learned during training. Imagine a model trained to predict real estate prices using features like square footage and number of bedrooms. If a sudden economic shift causes a surge in interest rates, the distribution of home prices and buyer behavior will change dramatically. The model, trained on data from a different economic climate, will struggle to make accurate predictions. The underlying data has "drifted."
The distribution of a feature during training compared to its distribution in live production. The shift indicates data drift, which can degrade model performance.
Concept drift is a more subtle but equally damaging issue. It happens when the relationship between the input features and the target variable changes. The statistical properties of the inputs might remain the same, but what they signify has changed.
For example, a model that predicts customer churn might learn that a lack of support tickets is a sign of a happy, non-churning customer. However, the company could launch a new, highly effective self-service help portal. Now, a lack of support tickets means the customer is successfully solving their own problems, but the underlying satisfaction or loyalty (the "concept" of a happy customer) has not changed its relationship with churn. The meaning of the feature "number of support tickets" has evolved.
In many cases, models suffer from a combination of both data and concept drift. This gradual decay in performance is often called model staleness. Monitoring is our primary tool for detecting it.
Effective monitoring involves tracking two distinct but related categories of metrics: the operational health of the system and the quality of the model's predictions.
These metrics are concerned with the health and stability of the software application that serves your model. They are similar to what you would monitor for any traditional web service.
These metrics measure the quality and reliability of the machine learning model's outputs. They are specific to MLOps and are essential for maintaining trust in the system.
Prediction Accuracy: This is the most direct measure of performance. It involves comparing the model's predictions to the actual outcomes (the "ground truth"). For a classification model, you might track accuracy, precision, and recall. For a regression model, you would monitor metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE). Acquiring ground truth can sometimes be delayed, making other proxy metrics important.
Data and Prediction Distributions: When ground truth is not immediately available, you can monitor for drift by tracking the statistical distributions of your input features and the model's output predictions. For example, if your model suddenly starts predicting "fraud" 50% of the time instead of its usual 1%, it's a strong signal that something has gone wrong, even if you don't know the true outcomes yet.
A monitoring dashboard showing model accuracy degrading over time. An alert is triggered when performance drops below a predefined threshold, signaling the need for investigation.
A monitoring system is built on a foundation of logging, visualization, and alerting.
latency > 500ms for more than five minutes or if accuracy < 90%.Monitoring closes the loop on the initial development cycle. The insights it provides are not just for fixing broken systems; they are the primary trigger for model improvement. When monitoring detects significant drift or performance decay, it's a clear signal that the current model is stale. This information feeds directly into the next and final stage of the lifecycle: creating a feedback loop to retrain and redeploy an updated model.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with