Deploying an optimized model artifact directly into a production environment is a high-risk activity that can jeopardize system stability and business outcomes. While a standard software deployment might rely on a simple rolling update, models introduce statistical and performance uncertainties that demand a more controlled approach. A new model version, even one that performed well in offline validation, could exhibit higher latency, produce unexpected prediction distributions on live data, or negatively impact user-facing business metrics. Safe deployment strategies are therefore not optional, they are a fundamental component of production MLOps.
This section focuses on two primary progressive delivery techniques for machine learning models: A/B testing and canary deployments. We will examine how to implement these patterns using a service mesh to precisely control traffic flow, enabling you to de-risk model updates and validate their performance with live users before committing to a full rollout.
While both A/B testing and canary deployments involve running multiple model versions simultaneously, their goals are distinct.
A/B Testing is an online experimentation technique used to compare the performance of two or more model versions against specific business metrics. For example, you might test a new recommendation model (version B) against the current production model (version A) to see which one generates a higher click-through rate. Traffic is typically split for a predetermined duration, and the results are statistically analyzed to select a winner.
Canary Deployment is a risk-mitigation strategy focused on safely verifying the stability of a new model version. The new model (the "canary") initially receives a small fraction of production traffic (e.g., 1% or 5%). System-level metrics like latency and error rates are closely monitored. If the canary performs as expected, its traffic share is gradually increased until it handles 100% of requests. If issues arise, traffic is immediately routed back to the stable version, minimizing user impact.
In short, you use a canary to ensure a new model doesn't break things, and you use an A/B test to prove a new model is quantifiably better. These strategies can also be combined; a new model might first go through a short canary phase to verify its technical stability, followed by a longer A/B test to measure its business impact.
Managing the network traffic for these deployment patterns requires a sophisticated routing layer. Implementing this logic directly within your application code or model server is brittle and complex. A far more effective approach is to use a service mesh like Istio or Linkerd.
A service mesh operates as a dedicated infrastructure layer that intercepts and controls all network communication between your microservices. It decouples traffic management from your application, allowing you to define complex routing rules declaratively using Kubernetes custom resources.
For a canary deployment, the most common technique is weight-based routing. Using an Istio VirtualService, you can specify the percentage of traffic that should be directed to each model version.
Consider a scenario where we have two Kubernetes Deployments serving our model: model-v1 (the stable version) and model-v2 (the canary). The following VirtualService configuration directs 95% of traffic to v1 and 5% to v2.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: model-serving-vs
spec:
hosts:
- model-inference.production.svc.cluster.local
http:
- route:
- destination:
host: model-inference-v1
subset: v1
weight: 95
- destination:
host: model-inference-v2
subset: v2
weight: 5
To progress the canary rollout, you simply update the weight fields in this manifest, for instance, to 80/20, then 50/50, and finally 0/100, a process that can be automated as we'll see later.
A typical traffic allocation during a four-phase canary rollout. The new version's traffic share increases as confidence in its stability grows.
For A/B tests or internal "dark launches," you may need more granular control. For example, you might want to route all requests from your internal QA team to the new model, while all other users continue to see the old one. This can be achieved with content-based routing, specifically by matching on an HTTP header.
The following VirtualService routes any request containing the header x-user-group: internal-testers to model-v2, while all other traffic goes to model-v1. This allows for targeted testing without affecting the general user population.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: model-serving-dark-launch
spec:
hosts:
- model-inference.production.svc.cluster.local
http:
- match:
- headers:
x-user-group:
exact: internal-testers
route:
- destination:
host: model-inference-v2
subset: v2
- route:
- destination:
host: model-inference-v1
subset: v1
This same pattern is invaluable for A/B testing. You can assign users to an experiment group (e.g., group-a or group-b) at the application layer, add a corresponding header, and use the service mesh to route them to the correct model version.
Manually adjusting traffic weights and monitoring dashboards is tedious and error-prone. The true power of this architecture is realized through automation. Progressive delivery controllers like Argo Rollouts or Flagger integrate with both the service mesh (Istio) and your monitoring system (e.g., Prometheus) to create an automated, closed-loop deployment process.
This process works as follows:
This automated feedback loop ensures that a problematic model version is contained and rolled back automatically, often before a human operator even notices the issue.
An automated canary deployment workflow. The controller shifts traffic, analyzes metrics from a provider like Prometheus, and either promotes the new version or automatically rolls back on failure.
By integrating these safe deployment practices into your MLOps pipeline, you transform model deployment from a high-stakes, manual event into a routine, automated, and low-risk process. This allows your team to iterate on models more quickly and confidently, delivering improvements to users without compromising the stability and performance of your production services.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with