A/B Testing and Canary Deployments for Models

Deploying an optimized model artifact directly into a production environment is a high-risk activity that can jeopardize system stability and business outcomes. While a standard software deployment might rely on a simple rolling update, models introduce statistical and performance uncertainties that demand a more controlled approach. A new model version, even one that performed well in offline validation, could exhibit higher latency, produce unexpected prediction distributions on live data, or negatively impact user-facing business metrics. Safe deployment strategies are therefore not optional, they are a fundamental component of production MLOps.

This section focuses on two primary progressive delivery techniques for machine learning models: A/B testing and canary deployments. We will examine how to implement these patterns using a service mesh to precisely control traffic flow, enabling you to de-risk model updates and validate their performance with live users before committing to a full rollout.

Distinguishing A/B Testing from Canary Deployments

While both A/B testing and canary deployments involve running multiple model versions simultaneously, their goals are distinct.

A/B Testing is an online experimentation technique used to compare the performance of two or more model versions against specific business metrics. For example, you might test a new recommendation model (version B) against the current production model (version A) to see which one generates a higher click-through rate. Traffic is typically split for a predetermined duration, and the results are statistically analyzed to select a winner.
Canary Deployment is a risk-mitigation strategy focused on safely verifying the stability of a new model version. The new model (the "canary") initially receives a small fraction of production traffic (e.g., 1% or 5%). System-level metrics like latency and error rates are closely monitored. If the canary performs as expected, its traffic share is gradually increased until it handles 100% of requests. If issues arise, traffic is immediately routed back to the stable version, minimizing user impact.

In short, you use a canary to ensure a new model doesn't break things, and you use an A/B test to prove a new model is quantifiably better. These strategies can also be combined; a new model might first go through a short canary phase to verify its technical stability, followed by a longer A/B test to measure its business impact.

Implementing Traffic Management with a Service Mesh

Managing the network traffic for these deployment patterns requires a sophisticated routing layer. Implementing this logic directly within your application code or model server is brittle and complex. A far more effective approach is to use a service mesh like Istio or Linkerd.

A service mesh operates as a dedicated infrastructure layer that intercepts and controls all network communication between your microservices. It decouples traffic management from your application, allowing you to define complex routing rules declaratively using Kubernetes custom resources.

Traffic Splitting for Canary Deployments

For a canary deployment, the most common technique is weight-based routing. Using an Istio VirtualService, you can specify the percentage of traffic that should be directed to each model version.

For example, imagine we have two Kubernetes Deployments serving our model: model-v1 (the stable version) and model-v2 (the canary). The following VirtualService configuration directs 95% of traffic to v1 and 5% to v2.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: model-serving-vs
spec:
  hosts:
  - model-inference.production.svc.cluster.local
  http:
  - route:
    - destination:
        host: model-inference-v1
        subset: v1
      weight: 95
    - destination:
        host: model-inference-v2
        subset: v2
      weight: 5

To progress the canary rollout, you simply update the weight fields in this manifest, for instance, to 80/20, then 50/50, and finally 0/100, a process that can be automated as we'll see later.

A typical traffic allocation during a four-phase canary rollout. The new version's traffic share increases as confidence in its stability grows.

Header-Based Routing for A/B Testing and Dark Launches

For A/B tests or internal "dark launches," you may need more granular control. For example, you might want to route all requests from your internal QA team to the new model, while all other users continue to see the old one. This can be achieved with content-based routing, specifically by matching on an HTTP header.

The following VirtualService routes any request containing the header x-user-group: internal-testers to model-v2, while all other traffic goes to model-v1. This allows for targeted testing without affecting the general user population.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: model-serving-dark-launch
spec:
  hosts:
  - model-inference.production.svc.cluster.local
  http:
  - match:
    - headers:
        x-user-group:
          exact: internal-testers
    route:
    - destination:
        host: model-inference-v2
        subset: v2
  - route:
    - destination:
        host: model-inference-v1
        subset: v1

This same pattern is invaluable for A/B testing. You can assign users to an experiment group (e.g., group-a or group-b) at the application layer, add a corresponding header, and use the service mesh to route them to the correct model version.

Automating Rollouts with Metrics-Driven Analysis

Manually adjusting traffic weights and monitoring dashboards is tedious and error-prone. The true power of this architecture is realized through automation. Progressive delivery controllers like Argo Rollouts or Flagger integrate with both the service mesh (Istio) and your monitoring system (e.g., Prometheus) to create an automated, closed-loop deployment process.

This process works as follows:

Deploy: A new model version is deployed as a canary.
Shift Traffic: The controller instructs the service mesh to route a small percentage of traffic to the canary.
Measure: The controller runs a series of automated queries against Prometheus to check important service-level indicators (SLIs) for a defined analysis period. These checks typically include:
- Latency: Is the $p99$ latency below the SLO (e.g., $p99 < 150ms$ )?
- Error Rate: Is the rate of HTTP 5xx server errors below a threshold (e.g., $< 0.1%$ )?
- Business Metrics: In more advanced setups, it can even query custom metrics, like the average prediction score, to check for model-specific regressions.
Decide: Based on the analysis, the controller makes a decision.
- Promote: If all metrics are within their thresholds, the controller increases the traffic weight to the canary and repeats the analysis.
- Roll Back: If any metric fails its check, the controller immediately shifts all traffic back to the stable version and marks the rollout as failed.

This automated feedback loop ensures that a problematic model version is contained and rolled back automatically, often before a human operator even notices the issue.

An automated canary deployment workflow. The controller shifts traffic, analyzes metrics from a provider like Prometheus, and either promotes the new version or automatically rolls back on failure.

By integrating these safe deployment practices into your MLOps pipeline, you transform model deployment from a high-stakes, manual event into a routine, automated, and low-risk process. This allows your team to iterate on models more quickly and confidently, delivering improvements to users without compromising the stability and performance of your production services.

Was this section helpful?

References

Designing Machine Learning Systems: An Iterative Process for Production-Ready AI, Chip Huyen, 2022 (O'Reilly Media) - Provides a comprehensive guide to building and deploying machine learning systems in production, including strategies for safe model updates.
Istio Documentation: Traffic Management, Istio Authors, 2023 - The official guide to configuring traffic routing with Istio, explaining weight-based and header-based rules critical for canary and A/B deployments.
Argo Rollouts Documentation, Argo Project Authors, 2023 - Official documentation detailing how to automate progressive delivery strategies in Kubernetes, with specific examples for canary and A/B testing using service meshes and metrics.
Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, Ron Kohavi, Diane Tang, Ya Xu, 2020 (Cambridge University Press) - A guide to designing and analyzing online controlled experiments, providing principles for A/B testing that apply to machine learning model evaluation.