Traditional machine learning workflows often treat feature engineering and selection primarily as optimization problems aimed at maximizing predictive accuracy on a specific dataset. Features are selected or engineered based on their statistical association with the target variable, using techniques like correlation analysis, mutual information, recursive feature elimination, or importance scores derived from predictive models (e.g., SHAP values, permutation importance). While effective for prediction under stable conditions, this approach can falter when the goal extends to understanding underlying mechanisms, predicting the effect of interventions, or ensuring robustness across changing environments. Incorporating causal principles provides a more structured and theoretically grounded approach to managing features.
A causal graph, typically a Directed Acyclic Graph (DAG), serves as an invaluable conceptual blueprint, even if it's only partially known or hypothesized based on domain expertise. It encodes assumptions about the data-generating process, allowing us to reason about the role of each variable and the potential biases introduced by including or excluding them.
Confounders are common causes of both a feature (or treatment) X and the outcome Y. In a causal graph, this appears as X←Z→Y. Failing to account for confounders leads to biased estimates of the relationship between X and Y. Standard predictive models often implicitly capture confounding effects, which helps predictive accuracy on similar data but obscures the true causal link. For causal understanding or intervention planning, identifying potential confounders using the DAG and including them in the model's feature set is essential for adjustment (e.g., via the backdoor criterion).
Colliders are variables that are common effects of two other variables. A canonical example relevant to feature selection is when a feature M is caused by both the feature of interest X and the outcome Y (X→M←Y). Conditioning on a collider (i.e., including it as a feature in a model) can induce a spurious statistical association between X and Y, even if they were originally independent. This is known as collider bias or endogenous selection bias. Causal graphs help identify potential colliders. Unless specifically modeling the process that generated the collider, such variables should generally be excluded from the feature set used for estimating the causal effect of X on Y.
Conditioning on a collider M opens a non-causal path between X and Y, potentially creating misleading associations.
Mediators lie on a causal pathway between a feature X and the outcome Y (X→M→Y). Including a mediator M in a model alongside X allows estimation of the direct effect of X on Y (the effect not passing through M). Excluding the mediator allows estimation of the total effect of X on Y. The decision of whether to include a mediator depends entirely on the specific causal question being asked. A causal graph makes the role of mediators explicit.
The variable M mediates the effect of X on Y. Including M blocks the indirect path, isolating the direct effect (if any).
Sometimes, critical confounders U are unobserved. However, we might observe proxy variables P that are caused by U (e.g., X←U→Y and P←U). Including proxies in a model requires careful consideration. While they don't fully substitute for the unobserved confounder, they can sometimes help mitigate bias. Techniques like Proximal Causal Inference (discussed in Chapter 4) provide a formal framework for using proxies under specific structural assumptions. Simple inclusion without this framework can sometimes increase bias.
Beyond selection, causal assumptions can inspire the creation of new features:
price_with_discount
alongside actual_price
.Standard feature selection prioritizes predictive power, often measured by metrics like gain in accuracy or reduction in prediction error. However, causally relevant features might not always be the most predictive ones in a specific dataset, and highly predictive features might be causally irrelevant or even harmful (like colliders or effects of the outcome).
Models built using features selected based on causal principles tend to exhibit greater robustness and generalize better under changing conditions or interventions.
Operationalizing causal feature management involves navigating several practical challenges:
Integrating causal principles into feature engineering and selection transforms it from a purely statistical optimization task into a more reasoned process grounded in assumptions about the data-generating mechanism. This shift is essential for building ML systems that are not just predictive but also reliable, interpretable, and actionable in real-world decision-making contexts.
© 2025 ApX Machine Learning