Deployed models operate in dynamic, live environments. Their performance can degrade silently if left unobserved, making continuous oversight essential. Model monitoring is the practice of continuously tracking and evaluating a model's operational health and predictive quality in production. It acts as a necessary warning system that tells you when a model no longer reflects the environment it operates in, protecting applications and businesses from the consequences of incorrect predictions.
Before you even consider if a model's predictions are correct, you must confirm that the service hosting the model is running properly. This is known as operational monitoring, and it shares many practices with monitoring any standard software application. The goal is to answer basic but significant questions about the service's availability and responsiveness.
The primary metrics you should track include:
500 (Internal Server Error), which point to bugs in the code or infrastructure failures.Prediction latency for a deployed model. The spike around 15:00 crossed the predefined alert threshold, signaling a performance issue that requires investigation.
A model can be perfectly healthy from an operational standpoint, responding quickly and without errors, yet still provide increasingly inaccurate predictions. This is a unique challenge in machine learning systems. Monitoring prediction quality involves tracking how well the model's predictions align with real outcomes. This is often complicated by two underlying problems: data drift and concept drift.
Data drift, also called input drift, occurs when the statistical properties of the data being fed to the model in production change from the data it was trained on. Models learn patterns from training data, so when the input data no longer resembles that training data, the learned patterns may no longer apply, and prediction accuracy will suffer.
For example, imagine a loan approval model trained on data from a stable economic period. If a recession begins, applicant data (income levels, employment status, credit inquiries) will change significantly. The model, unfamiliar with these new patterns, will likely perform poorly.
You can detect data drift by comparing the distribution of features in the live prediction requests against the distributions from the training dataset.
A comparison showing a shift in the age distribution between the training data and live production data. The model is now seeing a much younger population, which is an example of data drift.
Concept drift is a more subtle problem where the statistical properties of the input data might stay the same, but the relationship between the inputs and the output target changes. The meaning of the data has shifted.
Consider a spam detection model. Spammers are constantly inventing new tactics. An email with certain keywords that was benign a year ago might now be a strong indicator of a new phishing campaign. The input features (the words in the email) haven't changed, but their relationship to the concept of "spam" has. Concept drift is about changing in ways that make your model's learned rules obsolete.
To monitor for these issues, you track a different set of metrics:
Monitoring is not a passive activity. When a monitor detects a problem, it should trigger a workflow. This establishes a continuous loop that keeps the model effective over time.
A diagram of the MLOps monitoring loop. Detection of an issue triggers a process of diagnosis, retraining, and redeployment to maintain model performance.
This loop connects monitoring directly back to the development and deployment stages of the ML lifecycle, embodying the core principles of MLOps. A decline in performance is not a failure but a signal that the system is working as intended and that it is time for the model to adapt.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•