Okay, let's put the theory into practice. The previous sections laid out the formal machinery: Structural Causal Models (SCMs), graphical representations like DAGs, the rules of do-calculus, and various identification strategies. Now, we'll work through applying this logic to determine if a desired causal effect can be estimated from observed data, even when standard adjustment criteria aren't sufficient. Remember, identification precedes estimation; it tells us what to estimate, assuming our causal model is correct.
Consider the causal structure represented by the following Directed Acyclic Graph (DAG). We have variables W,X,M,Y, where X is the treatment, Y is the outcome, M is a mediator, and W is an observed covariate. Crucially, assume there's an unobserved common cause U affecting both X and Y.
Causal graph with observed variables W, X, M, Y and an unobserved confounder U.
Our goal is to identify the causal effect of X on Y, represented by the interventional distribution P(Y∣do(X=x)).
Analysis:
Backdoor Criterion: Can we find a set of observed variables Z that blocks all backdoor paths from X to Y? The backdoor paths are:
Frontdoor Criterion: Can we find a set of observed variables M that intercepts all directed paths from X to Y, satisfies certain blocking conditions, and for which the effects P(M∣do(X)) and P(Y∣do(M)) are identifiable?
This expression involves only probabilities estimable from observational data. Thus, the effect P(Y∣do(X=x)) is identifiable via the frontdoor criterion in this specific graph.
Takeaway: Even with an unobserved confounder U, careful application of criteria like the frontdoor adjustment (or systematically applying do-calculus) can lead to identification.
Consider a simplified system with potential feedback between X and Y, along with an observed covariate Z and an unobserved confounder U. We might represent this using a cyclic graph, although interpretation requires care (often implying an underlying temporal process or equilibrium state).
Causal graph with feedback between X and Y, an observed covariate Z, and unobserved confounder U.
Can we identify P(Y∣do(X=x))?
Analysis:
Graph modified by the intervention do(X=x), removing incoming edges to X. In this modified graph GXˉ, the only factor influencing Y (apart from the fixed X=x) is U. We need to find an expression for P(Y∣do(X=x)) using the original observational distribution. The path X→Y remains. The path X←Y is gone. The path X←U→Y is relevant in the original graph but the U→X link is severed by the intervention. However, the link U→Y remains. Can we condition on Z? In GXˉ, Z is disconnected from Y. Does Z block any backdoor paths in the original graph? X←Z. X←Y. X←U→Y. Z does not block the path through U.
Takeaway: Cycles, especially combined with unobserved confounding, often lead to non-identifiability using standard observational data. Advanced techniques or different data types (like interventional data or panel data, explored in later chapters) might be required. Sensitivity analysis becomes particularly important here to understand how assumptions about U might influence conclusions.
While manual application of do-calculus is fundamental for understanding, software libraries can automate parts of this process for complex graphs. Tools like Python's DoWhy
library allow you to define a causal graph (often using the GML or DOT format) and specify a causal query (e.g., identify P(Y∣do(X=x))).
import dowhy
import dowhy.gcm as gcm
# Define the graph from Scenario 1 (without U for simplicity here, or handle U)
# Using graphical model syntax (example)
causal_graph = """
digraph {
W -> X;
X -> M;
M -> Y;
W -> Y;
# U [label="Unobserved"]; # How U is handled depends on library features
# U -> X; U -> Y;
}
"""
# Assuming data is loaded into a pandas DataFrame `df`
# Initialize the CausalModel
model = dowhy.CausalModel(
data=df, # Your observational data
treatment='X',
outcome='Y',
graph=causal_graph
)
# Attempt identification
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
# Print the result
print(identified_estimand)
Running such code (potentially needing adjustments for handling U explicitly if the library supports it) would attempt to apply identification rules automatically. For Scenario 1, it should ideally return the frontdoor estimand we derived. For Scenario 2, it would likely report non-identifiability given the cycle and implied confounding (if U were representable).
Caution: Automated tools are powerful aids but not substitutes for understanding. They rely on the correctness of the input graph and assumptions. Always critically evaluate the tool's output and understand why a particular estimand was returned or why identification failed. Your grasp of do-calculus and identification logic allows you to verify these results and troubleshoot when the tool struggles with complex or non-standard cases.
These exercises illustrate that identification is a critical reasoning step. Before fitting any machine learning model for causal effect estimation (as covered in Chapter 3), you must first determine if the effect is estimable from your data and assumptions, and what statistical quantity corresponds to the causal effect you seek.
© 2025 ApX Machine Learning