Selection bias presents a distinct challenge compared to confounding bias, although both stem from issues related to unobserved factors. While confounding involves variables affecting both treatment and outcome, selection bias arises when the probability of being included in the analysis sample (S=1) is itself dependent on the treatment (T), the outcome (Y), or variables associated with both, often involving unobserved factors U. This means the analyzed sample is systematically different from the target population for which we want to draw causal conclusions. Standard estimators applied to such selected samples, even after adjusting for observed covariates X, can yield biased estimates of causal effects like the Average Treatment Effect (ATE) or Conditional Average Treatment Effect (CATE).
Consider estimating the effectiveness of a new online learning platform (T) on student test scores (Y). If only students who feel they benefited significantly (implying higher potential Y) choose to report their scores, the sample is selected based on factors related to the outcome. Similarly, estimating the return on investment for a startup accelerator program (T) based on the revenue (Y) of participating companies might suffer selection bias if only successful companies (S=1 depends on high Y) agree to share data.
The foundational work on addressing selection bias comes from econometrics, specifically the Heckman two-step correction procedure (often called Heckit). The core idea is to explicitly model the selection process alongside the outcome process.
Assume there's an underlying latent variable S∗ determining selection: Si∗=Ziγ+νi An individual i is selected into the sample (Si=1) if Si∗>0. Here, Zi represents variables influencing selection, γ are coefficients, and νi is an error term.
The outcome model for the entire population (selected or not) is: Yi=Xiβ+Tiα+ϵi Where Xi are covariates, Ti is the treatment, β and α are parameters (with α being the causal effect of interest), and ϵi is the error term.
Selection bias occurs if the error terms of the selection and outcome models, νi and ϵi, are correlated, typically because both are influenced by common unobserved factors U. If we naively estimate the outcome model using only the selected sample (Si=1), the expected value of the error term is non-zero: E[ϵi∣Si=1]=E[ϵi∣Ziγ+νi>0]=0
Under the assumption that νi and ϵi follow a bivariate normal distribution with correlation ρ, Heckman showed that this conditional expectation can be expressed as: E[ϵi∣Si=1]=ρσϵλ(Ziγ) where σϵ is the standard deviation of ϵi, and λ(⋅) is the Inverse Mills Ratio (IMR): λ(c)=Φ(c)ϕ(c) Here, ϕ is the standard normal probability density function (PDF), and Φ is the standard normal cumulative distribution function (CDF).
The Heckman two-step procedure involves:
A crucial requirement for identification in the classical Heckman model is the exclusion restriction: there must be at least one variable in Z that influences selection (S) but does not directly influence the outcome Y, except possibly through its effect on selection itself (i.e., it's not included in X). This variable acts like an instrument for the selection process.
While conceptually elegant, the classical Heckman procedure relies on strong assumptions that are often difficult to justify in complex machine learning contexts:
Modern ML applications often involve high-dimensional data, non-linear relationships, and complex interactions, demanding more flexible approaches. Adapting selection bias correction involves replacing the rigid parametric models with ML algorithms while retaining the core idea of modeling selection and correcting the outcome prediction.
Instead of a Probit model, we can use flexible binary classifiers to model the selection probability P(S=1∣Z). Suitable models include:
The input features Z would include all observed variables believed to influence sample inclusion. The output of this model gives us an estimated probability of selection, p^(Zi)=P^(S=1∣Zi).
With a selection model based on ML, we can adapt the correction step:
Using Inverse Probability of Selection Weighting (IPSW): If selection depends only on observed pre-treatment covariates Z (which might include X and potentially treatment T, if selection happens after treatment assignment), one might consider weighting the selected sample by the inverse of their estimated selection probability: wi=1/p^(Zi). The outcome model is then trained on the selected sample using these weights. However, this approach requires the selection mechanism to be independent of the outcome Y given Z, i.e., "selection on observables". This is a strong assumption often violated in practice (e.g., if people self-select into the study based on their perceived outcome). It also requires positivity (P(S=1∣Z)>0 for all relevant Z).
Control Function Approach (Generalized IMR): This is closer in spirit to the original Heckman idea. We need a way to capture the conditional expectation of the outcome error given selection, E[ϵi∣Si=1,Zi]. Without the normality assumption, the IMR derived earlier doesn't hold directly. However, we can use functions of the estimated selection probability p^(Zi) or the classifier's raw score as a "control function" included as an additional feature in the outcome model.
Let's outline the steps using hypothetical Python library calls for the control function approach:
# Assume:
# df_full contains features Z for selection modeling (all units)
# df_selected contains features X, treatment T, outcome Y, and Z (selected units only)
# 'selected' column in df_full indicates S=1 or S=0
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
# 1. Train Selection Model
# Features for selection model
Z_features = ['z1', 'z2', 'instrument'] # 'instrument' ideally satisfies exclusion restriction
# Target for selection model
selection_target = 'selected'
selector_model = RandomForestClassifier(n_estimators=100, random_state=42)
selector_model.fit(df_full[Z_features], df_full[selection_target])
# 2. Compute Control Function for Selected Sample
# Predict selection probability for the selected units
selection_probs_selected = selector_model.predict_proba(df_selected[Z_features])[:, 1]
# Define a control function (e.g., probability itself, or logit)
# Avoid division by zero or log(0)
epsilon = 1e-6
control_function = np.log(selection_probs_selected / (1 - selection_probs_selected + epsilon))
# Alternative: control_function = selection_probs_selected
df_selected['control_func'] = control_function
# 3. Train Outcome Model with Control Function
# Features for outcome model (excluding the instrument from Z if used)
X_features = ['x1', 'x2']
T_feature = 'treatment'
outcome_target = 'outcome'
# Combine features for the outcome model
outcome_features = X_features + [T_feature, 'control_func']
outcome_model = RandomForestRegressor(n_estimators=100, random_state=123)
outcome_model.fit(df_selected[outcome_features], df_selected[outcome_target])
# 4. Estimate Causal Effect (e.g., ATE)
# Create counterfactual dataframes for prediction
df_cf_treated = df_selected[outcome_features].copy()
df_cf_treated[T_feature] = 1
df_cf_control = df_selected[outcome_features].copy()
df_cf_control[T_feature] = 0
# Predict potential outcomes
y_hat_treated = outcome_model.predict(df_cf_treated)
y_hat_control = outcome_model.predict(df_cf_control)
# Estimate ATE on the selected sample
ate_estimate = np.mean(y_hat_treated - y_hat_control)
print(f"Estimated ATE (adjusted for selection bias): {ate_estimate:.4f}")
This code sketch illustrates the workflow. Real-world implementation requires careful feature engineering, hyperparameter tuning, cross-validation, and robust standard error estimation (often requiring bootstrapping due to the two-stage nature).
Adapting selection bias correction methods using ML offers more flexibility than classical approaches but inherits the fundamental challenge of relying on strong, often untestable, assumptions about the selection process and the role of unobserved variables. The requirement for an exclusion restriction is particularly difficult to meet convincingly in many ML settings.
Therefore, these methods should be applied cautiously, always accompanied by rigorous sensitivity analyses. When possible, alternative strategies discussed in this chapter, such as Instrumental Variables (if a valid instrument exists), Regression Discontinuity Designs (if a sharp assignment threshold exists), Difference-in-Differences (if panel data and parallel trends assumptions hold), or Proximal Causal Inference (if suitable proxy variables are available), might offer more credible routes to identification by leveraging different types of data structures or assumptions that avoid explicitly modeling the selection mechanism based solely on correlations. The choice of method depends heavily on the specific problem context, data availability, and the plausibility of the underlying assumptions.
© 2025 ApX Machine Learning