While constraint-based algorithms leverage conditional independence tests to prune the space of possible causal structures, score-based discovery algorithms take a different route. They formulate causal discovery as an optimization problem: define a score function that measures how well a candidate graph structure fits the observed data, and then search the space of possible graphs for the one that maximizes (or minimizes) this score.
The primary challenge is the vast search space. The number of possible Directed Acyclic Graphs (DAGs) grows super-exponentially with the number of variables, making exhaustive search computationally infeasible for all but the smallest problems. Therefore, score-based methods rely on heuristic or greedy search strategies. We will discuss two prominent approaches: Greedy Equivalence Search (GES) and Linear Non-Gaussian Acyclic Model (LiNGAM).
Instead of searching the space of individual DAGs, the Greedy Equivalence Search (GES) algorithm cleverly operates on the space of Markov Equivalence Classes (MECs). An MEC represents a set of DAGs that encode the same set of conditional independence relations and, therefore, cannot be distinguished using observational data alone under standard assumptions. Searching over MECs significantly reduces the complexity of the search space.
GES typically proceeds in two phases:
The final output of GES is typically represented as a Completed Partially Directed Acyclic Graph (CPDAG), which graphically characterizes the MEC identified.
The choice of score function is fundamental to GES. Common scores include:
GES relies on several assumptions:
GES is statistically consistent under these assumptions, meaning it converges to the true MEC as the sample size increases. However, it performs a greedy search, which means it can get stuck in local optima. Its performance is sensitive to the choice of score function and potential violations of the underlying assumptions, particularly causal sufficiency.
LiNGAM offers a distinct approach that, under specific assumptions, can identify the exact DAG structure, not just the MEC. It achieves this by leveraging information contained in the shape of the data distribution, specifically requiring non-Gaussianity.
The core assumption of LiNGAM is that the data generating process is linear and acyclic, with non-Gaussian error terms (innovations):
Xi=Xj∈PAi∑βijXj+ϵiHere, Xi is the i-th variable, PAi represents the set of its direct causal parents in the DAG, βij are the linear causal coefficients representing the strength of the direct effect from Xj to Xi, and ϵi are independent, non-Gaussian error terms with zero mean and non-zero variance. The acyclicity means that if we order the variables according to the causal structure (e.g., Xk1,Xk2,...,Xkd where k1,...,kd is a permutation of 1,...,d), the adjacency matrix B (where Bij=βij if Xj∈PAi and 0 otherwise) can be made strictly lower triangular by applying the permutation.
This model can be written in matrix form:
X=BX+ϵWhich implies:
X=(I−B)−1ϵwhere I is the identity matrix.
The insight behind LiNGAM comes from Independent Component Analysis (ICA). ICA is a technique used to separate multivariate signals into additive, independent, non-Gaussian source signals. In the LiNGAM context, the observed variables X are linear mixtures of the independent, non-Gaussian error terms ϵ. The matrix A=(I−B)−1 acts as the mixing matrix.
ICA algorithms aim to find an unmixing matrix W such that WX recovers the original independent sources (up to permutation and scaling). That is, W≈A−1=(I−B).
LiNGAM leverages the fact that if the correct unmixing matrix W is found, it holds a specific relationship to the causal structure matrix B: W=PD(I−B), where P is a permutation matrix and D is a diagonal scaling matrix. Critically, one can find a permutation of the rows of W such that the permuted matrix W′ has zeros corresponding to the zero entries in (I−B). Since B is lower triangular in the correct causal order, (I−B) is also lower triangular (with ones on the diagonal). Thus, W′ can be permuted to become lower triangular.
The LiNGAM algorithm (in its basic form) typically involves these steps:
LiNGAM's main strengths are its ability to identify the full DAG (including edge directions) and estimate causal strengths under its assumptions:
If these assumptions hold, LiNGAM provides a powerful tool for discovery. Various extensions exist to relax some assumptions, such as handling time series data (VAR-LiNGAM) or improving computational efficiency (DirectLiNGAM).
The strict linearity and non-Gaussianity requirements are LiNGAM's primary limitations. It fundamentally fails if the underlying error terms are Gaussian, as ICA cannot uniquely determine the rotation in this case. Performance also depends on the quality of the ICA estimation and the permutation search. Like GES, it assumes causal sufficiency.
A conceptual illustration of the LiNGAM data generating process. Independent non-Gaussian sources (ϵi) influence observed variables (Xi) through a linear acyclic structure defined by coefficients (βij). LiNGAM aims to recover this structure from the observed Xi.
If your data is strongly believed to be generated by a linear process and exhibits non-Gaussian characteristics, LiNGAM is a compelling option for potentially recovering the full DAG structure. If linearity is questionable or data appears Gaussian, GES (with an appropriate score) might be more suitable, acknowledging that it will only identify the MEC.
Libraries like causal-learn
provide implementations for GES and related algorithms, often allowing specification of different score functions (e.g., BIC for Gaussian data). The lingam
Python package offers various implementations of LiNGAM and its extensions (DirectLiNGAM, VAR-LiNGAM, etc.).
When applying these methods, remember their sensitivity to assumptions. Assess data distributions (e.g., using normality tests) before choosing LiNGAM. For GES, consider the implications of the chosen score function and potential sensitivity to local optima. Validating discovered structures using domain knowledge or interventional data (if available) is always advisable. Score-based methods, while optimizing a global criterion, can still be influenced by data quality and the specific heuristic search strategy employed.
© 2025 ApX Machine Learning