Estimating the Average Treatment Effect (ATE), E[Y(1)−Y(0)], is a fundamental goal in causal inference. However, achieving reliable estimates in the presence of high-dimensional confounders X presents significant statistical challenges. Simply including all potential confounders in a standard regression model, like Y∼θT+f(X), often fails. High dimensionality can lead to overfitting, unstable estimates, and importantly, regularization bias. When using methods like Lasso to select variables or shrink coefficients in f(X), the estimated treatment effect θ^ can become biased, even if the model for f(X) is correctly specified. Furthermore, specifying the functional form f(X) correctly is difficult when X is high-dimensional.
Double Machine Learning (DML) provides a powerful framework to address these issues, enabling the use of flexible machine learning methods for confounder adjustment without introducing substantial bias into the final ATE estimate. The core idea relies on two pillars: Neyman Orthogonality and Cross-Fitting.
The intuition behind DML comes from the concept of orthogonalization, reminiscent of the Frisch-Waugh-Lovell theorem in econometrics. Instead of directly modeling Y as a function of T and X, DML focuses on modeling the parts of Y and T that are not explained by the confounders X.
Consider a simplified setting, the Partially Linear Model (PLM):
Y=θ0T+g0(X)+ϵ,E[ϵ∣X,T]=0T=m0(X)+ν,E[ν∣X]=0Here, Y is the outcome, T is the treatment (often binary, but DML handles continuous treatments too), X represents the high-dimensional confounders, θ0 is the target causal parameter (ATE under assumptions), g0(X)=E[Y∣X] captures the effect of confounders on the outcome (conditional on T=0, if T is binary and affects the intercept), and m0(X)=E[T∣X] is the propensity score function (for binary T) or the conditional expectation of T given X (for continuous T). The terms ϵ and ν are errors orthogonal to their respective conditioning variables.
The naive approach of estimating θ0 by regressing Y on T and some estimate g^(X) suffers if g^ is poorly estimated or if regularization is used. DML circumvents this by estimating θ0 using residuals.
Define the residuals:
Substituting the model equations (and assuming E[Tϵ∣X]=0), we can show that:
Y~≈θ0T~+ϵ′Where ϵ′ is an error term approximately orthogonal to T~. This suggests that we can estimate θ0 by regressing the outcome residuals Y~ on the treatment residuals T~.
Why does this residualization help? The point is Neyman Orthogonality. A statistical estimation problem is Neyman-orthogonal if the first-order influence of small errors in estimating nuisance parameters (like g0(X) and m0(X)) on the target parameter estimate (θ^0) is zero. The final regression step in DML (Y~ on T~) possesses this property.
This means we can use sophisticated, potentially complex machine learning models to estimate the nuisance functions g^0(X)≈E[Y∣X] and m^0(X)≈E[T∣X] without their inherent biases (e.g., regularization bias) directly contaminating the estimate of θ0. As long as our ML models for g^0 and m^0 converge reasonably fast (typically faster than n−1/4 in terms of mean squared error), the final estimate θ^0 will be consistent and approximately normally distributed under standard conditions, allowing for valid statistical inference (confidence intervals, hypothesis tests).
A critical component of DML is cross-fitting (or sample splitting). If we use the same data sample to estimate the nuisance functions (g^0, m^0) and then compute the residuals and estimate θ0, we introduce bias due to overfitting. The nuisance models might inadvertently fit noise in the data that is correlated with the treatment residual, biasing the final estimate.
Cross-fitting breaks this dependency:
Or more generally, solve the orthogonal moment condition: n1∑i=1n(Y~i−θ^0T~i)T~i=0.
Diagram illustrating the K-fold cross-fitting procedure in Double Machine Learning. Nuisance models are trained on out-of-fold data and used to predict residuals for the current fold, ensuring predictions for each observation come from a model it wasn't trained on.
EconML
and DoubleML
implement various DML estimators for different model structures.In summary, Double Machine Learning provides a principled and practical approach to estimating average treatment effects in high-dimensional settings. By combining the predictive power of machine learning with the theoretical guarantees of Neyman orthogonality and the bias reduction from cross-fitting, DML allows researchers and practitioners to obtain more reliable causal estimates from complex observational data. This technique forms a foundation for effect estimation before we move on to exploring how effects vary across individuals using methods designed for Conditional Average Treatment Effects (CATE).
Was this section helpful?
© 2025 ApX Machine Learning