Building upon the foundation laid in prerequisite studies, we now formalize the framework of Structural Causal Models (SCMs). While you likely possess a working understanding of SCMs and their associated Directed Acyclic Graphs (DAGs), establishing a rigorous definition is essential for navigating the advanced identification strategies and estimation techniques covered in this course. SCMs provide the mathematical machinery to explicitly represent causal assumptions, define interventions, and ultimately determine if causal effects are estimable from available data.
Defining Structural Causal Models
An SCM provides a complete description of how a system's variables acquire their values. Formally, a Structural Causal Model M is a tuple M=⟨U,V,F,P(U)⟩, where:
- U is a set of exogenous variables. These are variables determined by factors outside the model. They represent stochastic background conditions, noise, or unmodeled influences. Think of them as the fundamental sources of randomness in the system.
- V is a set of endogenous variables. These are the variables whose values are determined by other variables within the model (both endogenous and exogenous). These are typically the variables of interest whose relationships we aim to understand, such as treatments, outcomes, and confounders.
- F is a set of structural equations, one for each endogenous variable Vi∈V. Each equation specifies the value of Vi as a function fi of other variables in the model, both endogenous (PAi⊆V∖{Vi}) and exogenous (Ui⊆U):
Vi=fi(PAi,Ui)
Here, PAi denotes the set of endogenous parents of Vi, meaning the variables in V that directly affect Vi according to the function fi. The Ui associated with Vi represents the exogenous influences specific to that variable's determination, capturing any unmodeled factors or intrinsic randomness.
- P(U) is a probability distribution over the exogenous variables U. This distribution captures the potential correlations and dependencies among the background factors influencing the system.
Crucially, each equation Vi=fi(PAi,Ui) represents an autonomous causal mechanism. It describes how Vi is determined based on its direct causes and specific exogenous factors, independent of the mechanisms governing other variables.
Key Assumptions and Their Implications
The SCM definition implicitly carries significant assumptions:
- Implicit Causal Sufficiency: The standard definition assumes that for any pair of endogenous variables Vi,Vj, any common cause is either included in V or its influence is captured by the correlations within P(U). If we further assume the exogenous variables Ui are mutually independent (P(U)=∏P(Ui)), then the model assumes there are no unobserved common causes among the endogenous variables V. This strong assumption is often unrealistic in practice, and Chapter 4 delves into methods for addressing violations (unobserved confounding). If the Ui are not assumed independent, their dependency structure represents latent confounding variables.
- Acyclicity (Typically): Most introductory treatments and many advanced methods assume the causal relationships encoded in F do not form directed cycles. That is, a variable cannot be its own ancestor. This implies that the relationships among the endogenous variables V can be represented by a Directed Acyclic Graph (DAG). We will specifically address models with cycles and feedback loops later in this chapter ("Addressing Cycles and Feedback in Causal Graphs").
- Modularity/Autonomy: This is a cornerstone assumption. It posits that each mechanism fi is distinct and can, in principle, be altered (e.g., by an intervention) without affecting the other mechanisms fj for j=i. This property is what allows us to precisely define and analyze the effect of interventions.
From SCMs to Graphs and Distributions
An SCM naturally induces a causal graph (often a DAG). The nodes of this graph correspond to the endogenous variables V. A directed edge exists from Vj to Vi if and only if Vj is an argument in the function fi (i.e., Vj∈PAi). The absence of an edge from Vj to Vi encodes the strong causal assumption that Vj has no direct causal effect on Vi.
Consider a simple SCM:
U={UX,UY,UZ} assumed independent, standard normal.
V={X,Y,Z}
F={
X=fX(UX)=UX,
Z=fZ(X,UZ)=αX+UZ,
Y=fY(X,Z,UY)=βX+γZ+UY
}
P(U) specifies the distributions (e.g., standard normal, independent).
This SCM induces the following graph:
The graph shows direct causal influences among endogenous variables (solid arrows) and indicates exogenous influences (dashed lines, often omitted when U are assumed independent).
An SCM M defines a unique observational probability distribution P(V) over the endogenous variables. This distribution arises from propagating the uncertainty from P(U) through the structural equations F.
Interventions in SCMs: The do-Operator
The power of SCMs lies in their ability to formally model interventions. An intervention, denoted by the do-operator, represents an external action that forces a variable X∈V to take on a specific value x, overriding its natural causal mechanism.
Formally, performing the intervention do(X=x) on an SCM M=⟨U,V,F,P(U)⟩ creates a modified SCM, denoted Mx=⟨U,V,Fx,P(U)⟩. The only difference is in the set of functions Fx: the original equation for X, X=fX(PAX,UX), is replaced by the constant assignment X=x. All other structural equations in F remain unchanged, reflecting the modularity assumption.
The interventional distribution P(Y∣do(X=x)) (or more generally, the post-intervention distribution PMx(V)) is the distribution of variables resulting from this modified model Mx.
It is fundamental to understand that P(Y∣do(X=x)) is conceptually and often numerically different from the observational conditional distribution P(Y∣X=x).
- P(Y∣X=x) describes the distribution of Y in the subpopulation where X happens to be x, calculated from the original SCM M. This reflects passive observation.
- P(Y∣do(X=x)) describes the distribution of Y when we force X to be x, calculated from the modified SCM Mx. This reflects an active manipulation.
In our example graph, P(Y∣X=x) is influenced by the path X→Z→Y and the direct path X→Y. If X and Y shared an unobserved common cause represented by correlated UX,UY, then P(Y∣X=x) would also reflect this non-causal correlation. However, P(Y∣do(X=x)) reflects only the causal influence flowing out of the manipulated X along the paths X→Y and X→Z→Y. It's computed in the model where X's value is fixed to x, severing any incoming arrows to X.
This precise definition of interventions within the SCM framework is the basis for causal effect identification. The central question becomes: can we compute P(Y∣do(X=x)) using only the observed distribution P(V) and assumptions about the causal structure (the graph), without needing full knowledge of the functions fi or the distribution P(U)? The do-calculus, explored next, provides formal rules for answering this question. Mastering the SCM formalism is the necessary first step towards applying these powerful tools.