Traditional machine learning evaluation often stops at predictive performance on unseen data drawn from the same distribution as the training data. Metrics like accuracy, AUC, or Mean Squared Error tell us how well a model mimics correlations observed in the data. However, as outlined in the chapter introduction, this is often insufficient when deploying systems intended to support decisions or understand the impact of actions. Relying solely on predictive metrics can lead to models that perform well on paper but fail when used for interventions or generalize poorly to slightly different settings. Evaluating models through a causal lens provides a more rigorous assessment of their suitability for real-world decision-making and intervention planning.
This section details how to move beyond standard evaluation by incorporating causal concepts. We will examine metrics and methodologies designed to assess models based on their understanding of underlying causal mechanisms.
Assessing Predictive Performance under Interventions
A primary goal in many ML applications is not just to predict what will happen under observed conditions, but what would happen if we actively intervened on the system. Standard evaluation doesn't measure this directly. Causal evaluation focuses on assessing a model's ability to predict outcomes under hypothetical interventions, often represented using Pearl's do-calculus notation, P(Y∣do(X=x)).
Consider a model predicting crop yield (Y) based on fertilizer amount (X) and rainfall (W). A standard evaluation assesses P(Y∣X=x,W=w). However, a farmer wants to know the yield if they set the fertilizer amount to x, regardless of natural rainfall patterns. This requires evaluating the model's prediction against the interventional distribution P(Y∣do(X=x)).
Methodology:
- Define the Causal Estimand: Clearly state the intervention effect you want the model to predict (e.g., Average Treatment Effect (ATE) = E[Y∣do(X=1)]−E[Y∣do(X=0)], or Conditional Average Treatment Effect (CATE) = E[Y∣do(X=1),Z=z]−E[Y∣do(X=0),Z=z]).
- Obtain Ground Truth (if possible):
- RCT Data: Randomized Controlled Trials (RCTs) or A/B tests directly estimate P(Y∣do(X=x)) by actually performing the intervention. This provides a gold standard for comparison.
- Causal Effect Estimation: If only observational data is available, use robust methods from previous chapters (e.g., Double Machine Learning, Causal Forests, IV, RDD) applied to the test set to estimate the target intervention effect. This serves as a benchmark, acknowledging it relies on untestable assumptions.
- Simulations: Use a known Structural Causal Model (SCM) to simulate data under observation and intervention, providing perfect ground truth for model evaluation in a controlled setting.
- Evaluate the Model: Compare the ML model's predictions under intervention scenarios to the ground truth or benchmark estimate. For instance, if the model predicts individual outcomes Y^i, compute the model-implied ATE (e.g., average Y^i for units simulated under do(X=1) minus average Y^i for units under do(X=0)) and compare it to the benchmark ATE.
Scatter plot comparing Conditional Average Treatment Effect (CATE) estimates from a benchmark causal model (e.g., Causal Forest on test data) against the CATE predictions derived from the machine learning model being evaluated. Points closer to the diagonal line indicate better alignment in predicting intervention effects.
Evaluating Counterfactual Fairness
Standard fairness metrics often assess associations between sensitive attributes (e.g., race, gender) and outcomes or errors. However, these associations can arise from various pathways, some considered fair (e.g., differences due to qualifications correlated with a sensitive attribute) and others unfair (e.g., direct discrimination). Counterfactual fairness asks: "Would the model's prediction for an individual change if their sensitive attribute had been different, but all other background factors remained the same?"
This requires reasoning about counterfactuals, evaluating quantities like P(Y^A=a′=Y^∣A=a,X=x), where Y^ is the model prediction, A is the sensitive attribute, and X represents other features. Evaluating this requires a causal model specifying the relationships between A, X, and the true outcome Y.
Simplified causal graph illustrating potential pathways influencing a model prediction (Y^). Counterfactual fairness aims to isolate the direct path from the sensitive attribute (A) to the prediction (Y^) that does not operate through legitimate intermediate factors (X).
Methodology:
- Posit a Causal Graph: Define the assumed causal relationships between the sensitive attribute A, other features X, the true outcome Y, and potentially unobserved factors U.
- Estimate Counterfactuals: Using the assumed causal model (often an SCM) and observational data, estimate the counterfactual prediction Y^A=a′ for individuals observed with A=a. This typically involves adjusting for confounders based on the graph.
- Calculate Fairness Metrics: Compute metrics based on the difference between the observed prediction Y^A=a and the estimated counterfactual prediction Y^A=a′. Examples include the proportion of individuals for whom the prediction changes, or the average difference in prediction scores.
- Sensitivity Analysis: Since the causal graph is an assumption, perform sensitivity analysis to assess how violations (e.g., unobserved confounding) might affect the fairness conclusions.
Assessing Transportability and Generalizability
ML models are often deployed in environments different from their training setting. Distribution shift is a common problem, but causal inference provides tools to analyze why performance degrades and when a model might be transportable. Causal graphs incorporating selection variables or representing differences between domains can help identify potential issues.
Example: A model predicting patient risk is trained in City A and deployed in City B.
- Covariate Shift: Patient demographics (X) differ (PA(X)=PB(X)), but the relationship P(Y∣X) remains the same. Standard domain adaptation techniques often suffice.
- Concept Drift: The causal mechanism itself changes (PA(Y∣do(X=x))=PB(Y∣do(X=x))). For example, treatment effectiveness differs due to unmeasured factors prevalent in City B. The model may not transport well for intervention planning.
- Selection Bias: The way data is sampled differs. For instance, City A's data comes from routine checkups, while City B's data includes more emergency room visits, biasing the sample. A causal graph with selection nodes can model this.
Causal graphs illustrating potential differences between a training domain (A) and a target domain (B). Evaluating transportability involves assessing differences in feature distributions (P(X)), causal mechanisms (P(Y∣do(X))), and selection processes (P(S∣X,Y)).
Methodology:
- Model Domain Differences: Use causal graphs (potentially augmented with selection diagrams or domain nodes) to represent assumptions about how the source and target domains differ.
- Identify Transportability Conditions: Apply causal transportability theory to determine if the causal effect or prediction target in the target domain is identifiable from the source domain data and potentially limited target domain data.
- Evaluate on Target Domain: If possible, collect labeled or unlabeled data from the target domain. Evaluate the model's predictive performance and, more significantly, its ability to predict intervention effects (estimated using appropriate methods) in the target setting.
- Domain Adaptation Informed by Causality: Use insights from the causal graph to guide domain adaptation techniques, focusing on adjusting for factors identified as sources of non-transportability.
Evaluating Models for Policy Decisions
When an ML model is used to derive a policy (e.g., deciding who receives a loan, a medical treatment, or a promotion), the evaluation should focus on the causal impact of implementing that policy. This often falls under the umbrella of off-policy evaluation (OPE), particularly relevant in reinforcement learning contexts but applicable more broadly.
Methodology:
- Define the Policy: Specify the rule derived from the ML model (e.g., treat if predicted CATE > threshold τ).
- Estimate Policy Value: Use observational data and methods like Inverse Propensity Scoring (IPS), Direct Method (using a separate outcome model), or Doubly Robust estimation (combining both) to estimate the expected outcome if the model-derived policy were deployed.
- Compare Policies: Evaluate the value of the ML-derived policy against alternative policies (e.g., treat everyone, treat no one, policy from a different model, existing heuristic policy).
- Check Assumptions: OPE methods rely on causal assumptions (e.g., sequential ignorability, positivity). Assess the plausibility of these assumptions and perform sensitivity analyses.
Integrating Causal Evaluation into the Workflow
Moving beyond standard metrics requires integrating these causal evaluation techniques into the model development lifecycle:
- During Development: Use simulations based on hypothesized SCMs to test if models can recover known causal effects.
- During Testing/Validation: Compare model-predicted intervention effects against estimates from robust causal inference methods on hold-out data or against results from A/B tests. Evaluate counterfactual fairness using appropriate estimators and sensitivity checks. Assess transportability if deployment context differs from training.
- Post-Deployment: Continuously monitor model performance not just for predictive accuracy drift but also for potential changes in underlying causal mechanisms (Chapter 6, "Monitoring ML Systems for Causal Stability"). Use causal methods to analyze the real-world impact of decisions made based on the model.
By adopting these causal evaluation perspectives, you move towards building machine learning systems that are not merely predictive patterns mimics but tools that offer reliable guidance for interventions and decision-making in complex, real-world environments.