While observational data provides the foundation for causal discovery, relying on it alone often leaves us with fundamental ambiguities. As discussed previously, multiple Directed Acyclic Graphs (DAGs) can represent the same set of conditional independencies observed in data. These graphs form a Markov Equivalence Class (MEC). Without further information, distinguishing the true causal structure within an MEC solely from observational data is generally impossible. This is where interventional data becomes indispensable. Interventions actively manipulate the system, providing direct evidence about causal directions that passive observation cannot.
Consider a simple system with three variables: X, Y, and Z. Suppose observational data reveals that X and Z are conditionally independent given Y (X⊥Z∣Y), but X and Y are dependent, and Y and Z are dependent. This independence structure is compatible with multiple DAGs within the same MEC, including:
Observational data alone cannot distinguish between these structures. However, performing an intervention changes the game. Imagine we can intervene on Y, setting its value to a specific constant y′, denoted as do(Y=y′). This action effectively removes all incoming causal arrows into Y in the true graph.
By observing the (in)dependencies in the interventional distribution P(X,Z∣do(Y=y′)), we can potentially distinguish between these structures. For instance, if X and Z remain dependent after intervening on Y, we can rule out the first two structures.
The diagram below illustrates how observing independencies under intervention helps distinguish between two observationally equivalent graphs.
Comparing two graphs from the same MEC. Observational data shows X⊥Z∣Y for both. Intervening on Y (red node) breaks incoming edges. In the top structure (X→Y→Z), the Y→Z path remains active, potentially preserving dependency. In the bottom structure (X←Y→Z), both paths from Y are severed by the intervention, leading to X⊥Z.
Interventions can vary in their nature and scope, providing different kinds of information:
Several causal discovery algorithms are designed to exploit the additional information provided by interventions:
Constraint-based Extensions (e.g., modifications to PC/FCI): Standard constraint-based algorithms rely on conditional independence tests on observational data. When interventional data is available, these tests can be applied to subsets of the data corresponding to specific interventions. For a perfect intervention do(Xk=xk′), we know that Xk has no parents in that subset. This information can be used to orient edges that were previously undirected in the observational MEC. For instance, if X−Y is an undirected edge based on observational data, but we observe that X and Y become independent under do(Y=y′), this suggests the orientation X→Y. FCI (Fast Causal Inference), which handles latent confounders, can also incorporate interventional data to refine the resulting Partial Ancestral Graph (PAG).
Score-based Extensions (e.g., GIES): Greedy Interventional Equivalence Search (GIES) extends the Greedy Equivalence Search (GES) algorithm. GES searches through the space of MECs using a score (like BIC). GIES modifies the scoring procedure to handle a mix of observational and interventional data. It assumes known intervention targets. The score for a candidate graph is typically calculated by summing scores over different data regimes (observational and various interventions), where the likelihood term for each regime accounts for the modified graph structure under intervention (i.e., removing incoming edges for perfectly intervened nodes).
Score(G,D)=j=0∑mScorelocal(G,Dj)Where D0 is the observational dataset, Dj for j>0 are datasets under different intervention regimes, and Scorelocal evaluates the fit of the (potentially modified) graph G to the data Dj.
Invariant Causal Prediction (ICP): ICP leverages data from multiple "environments" or "settings". These environments could arise naturally (e.g., data from different locations) or be explicitly created by interventions. The core idea is that the true causal predictors Pa(Y) of a target variable Y remain invariant across these environments, meaning the conditional distribution P(Y∣Pa(Y)) is stable. Non-causal predictors (e.g., descendants or variables correlated through confounding) may exhibit statistical relationships that change across environments. ICP searches for a set of predictors S such that the conditional distribution P(Y∣XS) is invariant across all observed environments. This approach is particularly useful when interventions are present but perhaps not perfectly characterized.
Often, the most effective strategy involves using both observational and interventional data. Observational data provides broad coverage and helps identify the initial MEC. Targeted interventions can then be designed or utilized to resolve the remaining ambiguities within that class. Algorithms like GIES are explicitly designed for this mixed setting.
When working with combined data, it's important to structure it correctly. Typically, this involves pooling the data and adding columns indicating the intervention status for each sample.
import pandas as pd
# Sample Observational Data
obs_data = pd.DataFrame({
'X': [0.1, 1.1, -0.5, ...],
'Y': [0.5, -0.2, 1.8, ...],
'Z': [-1.0, 0.8, 0.1, ...],
'intervention_target': ['None'] * n_obs
})
# Sample Interventional Data (intervened on Y)
int_data_Y = pd.DataFrame({
'X': [0.3, -1.2, 0.7, ...],
'Y': [10.0] * n_int_Y, # Fixed value via intervention
'Z': [0.6, -0.9, 1.1, ...],
'intervention_target': ['Y'] * n_int_Y
})
# Combine datasets
combined_data = pd.concat([obs_data, int_data_Y], ignore_index=True)
# Conceptual usage with a hypothetical discovery function
# from causallearn.search.ScoreBased.gies import gies
# Assuming 'gies' function can take intervention target information
# result = gies(combined_data[['X', 'Y', 'Z']].values,
# intervention_targets=combined_data['intervention_target'].tolist())
# 'result' would contain information about the learned graph structure
If you have the resources to perform interventions, the question arises: which interventions are most informative? Optimal experimental design for causal discovery aims to select a sequence of interventions that most efficiently resolves the ambiguities in the current estimated MEC or PAG. This often involves identifying interventions that are expected to maximally reduce the number of possible causal structures consistent with the data. This is an advanced topic, often involving complex scoring or information-theoretic criteria.
While powerful, using interventional data comes with challenges:
Despite these challenges, the ability of interventions to break symmetries and orient edges makes interventional data a highly valuable resource for robust causal discovery, pushing beyond the limitations of purely observational approaches. When available, integrating interventional data significantly strengthens the confidence in the inferred causal structures.
© 2025 ApX Machine Learning