Having explored various algorithms like PC, FCI, GES, and LiNGAM for uncovering causal structures, a fundamental question arises: how well did they actually perform? Determining the accuracy and reliability of a discovered causal graph is as significant as the discovery process itself. This is particularly challenging because, in most real-world scenarios, the true underlying causal graph is unknown.
Evaluating causal discovery algorithms rigorously often relies on simulation studies or benchmark datasets where the ground truth structure is known. In these settings, we can employ quantitative metrics to compare the estimated graph (G^) with the true graph (G).
Metrics for Graph Comparison (Ground Truth Known)
When you have the luxury of knowing the true causal graph, several metrics can quantify the performance of your discovery algorithm.
Structural Hamming Distance (SHD)
The most common metric is the Structural Hamming Distance (SHD). It measures the number of structural differences between the estimated graph and the true graph. Specifically, SHD counts the minimum number of single-edge modifications (additions, deletions, or reversals) required to make the estimated graph match the true graph.
Let G=(V,E) be the true graph and G^=(V,E^) be the estimated graph, both over the same set of vertices V. The SHD is calculated by comparing their adjacency matrices. An edge A→B in G^ is:
- A true positive (TP) if A→B exists in G.
- A false positive (FP) if A→B does not exist in G, and B→A does not exist in G.
- A reversed edge if B→A exists in G.
An edge A→B in G is a false negative (FN) if neither A→B nor B→A exists in G^.
SHD is the sum of edge additions (FP), deletions (FN), and reversals needed to transform G^ into G. A lower SHD indicates a better match. For example, if the true graph is A→B←C and the algorithm estimates A→B→C, the SHD is 1 (one edge reversal: B←C vs B→C).
Example comparing a true graph (A→B←C) and an estimated graph (G^:A→B→C). The SHD is 1 due to the single reversed edge B←C vs B→C.
Precision, Recall, and F1-Score
Borrowing from classification evaluation, we can adapt precision, recall, and F1-score to assess edge discovery. These metrics are often computed based on the skeleton (the undirected graph obtained by ignoring edge orientations) or sometimes on the directed edges themselves.
- Structural Precision: The proportion of edges in the estimated graph G^ that are also present (ignoring direction) in the true graph G.
Precisionskel=∣Edges in G^∣∣Correct Skeletons∣
- Structural Recall: The proportion of edges in the true graph G that are also present (ignoring direction) in the estimated graph G^.
Recallskel=∣Edges in G∣∣Correct Skeletons∣
- Structural F1-Score: The harmonic mean of precision and recall for the skeleton.
F1skel=2⋅Precisionskel+RecallskelPrecisionskel⋅Recallskel
Similar metrics can be defined for orientation accuracy, focusing only on the correctly identified directions among the correctly identified adjacencies.
- Orientation Precision: Proportion of correctly oriented edges among all oriented edges in G^.
- Orientation Recall: Proportion of correctly oriented edges in G^ relative to all oriented edges in G.
Evaluation Without Ground Truth
In practice, you rarely know the true causal graph. How can you assess performance then?
- Sensitivity Analysis: Evaluate how sensitive the discovered structure is to the algorithm's hyperparameters. For constraint-based methods (like PC or FCI), vary the significance level (α) for independence tests. For score-based methods (like GES), try different score functions (e.g., BIC, BDeu). If the core structure remains stable across reasonable parameter variations, it inspires more confidence.
- Algorithm Comparison: Apply multiple discovery algorithms with different underlying assumptions (e.g., a constraint-based method like FCI and a score-based method like GES, or methods assuming linearity like LiNGAM versus those that don't). Consistent structural features found by diverse algorithms are more likely to be real. Discrepancies highlight areas of uncertainty or potential assumption violations.
- Domain Knowledge: Consult with domain experts. Does the discovered graph align with established knowledge or plausible mechanisms? Are there edges that seem highly unlikely, or missing edges that are strongly expected? This qualitative assessment is indispensable.
- Bootstrap Resampling: Apply the discovery algorithm to multiple bootstrap samples drawn from your original data. Analyze the frequency with which specific edges appear in the discovered graphs across samples. Edges appearing with high frequency are considered more robust.
- Downstream Task Evaluation: This is an indirect but practical approach. Use the discovered causal graph to inform a subsequent task, such as building a predictive model (using causally identified features) or estimating treatment effects (using the graph to identify confounders). If the causally-informed model performs better (e.g., generalizes better to new environments, yields more plausible effect estimates) than a baseline model ignoring causal structure, it provides evidence supporting the discovered graph's utility, if not its absolute correctness.
- Consistency with Interventional Data: If even limited interventional data is available (e.g., from A/B tests or natural experiments), check if the causal relationships implied by the discovered observational graph are consistent with the observed effects of these interventions.
Practical Considerations
- Choice of Metric: The best metric depends on the goal. SHD provides a single overall distance. Precision/Recall/F1 can give more insight into specific error types (too many edges vs. too few).
- Graph Representation: Be mindful of what the algorithm outputs. Some algorithms (like PC) output a Completed Partially Directed Acyclic Graph (CPDAG), representing an equivalence class, while others might output a specific DAG. Metrics should be applied appropriately (e.g., SHD can be adapted for CPDAGs).
- Computational Cost: Some evaluation methods, like bootstrapping, can be computationally intensive, especially with complex algorithms or large datasets.
Evaluating causal discovery algorithms is inherently challenging due to the frequent absence of ground truth. Combining quantitative metrics (in simulations) with robustness checks, algorithm comparisons, domain knowledge validation, and downstream task performance assessment provides a more comprehensive picture of an algorithm's success in a specific application. No single method is foolproof, so a multi-faceted evaluation strategy is generally recommended.