Machine learning, Causal Inference, Algorithmic game theoryなどに興味があります。

Variable Selection For Causal Inference | Causal Inference: what if, Chapter 18

This is a personal note for my study. There might be some errors, so if you notice them, I'd be happy to let me know.

The notation here follows Causal Inference: What if.

Here is a link to this book.

Also, the figures shown in the note come from Hernán MA, Robins JM (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.

There are several methods to estimate the causal effect of a treatment A on an outcome Y. For example,

  • stratification
  • outcome regression
  • standardization
  • parametric g-formula
  • g-estimation

All of these methods above need to adjust covariates L to achieve conditional exchangeability though in different ways. How do we select such covariates?

A means to select covariates to be used to predict an outcome Y cannot be applied to the selection of covariates for causal analyses. Hence, this chapter summarizes the ways to select variables for the adjustment.

18.1 The different goals of variable selection

Prediction tasks do not have to consider confounding variables while causal inference tasks do have to consider confounding ones. So, applying an automatic selection of variables for predictive models to causal inference models introduces some bias.

Here are methods for the selection of variables for predictive purposes.

Selection of variables for predictive purposes

Brute force

Try all possible combinations of variables available to measure predictive power using some pre-defined criterion (e.g. Akaike's information criterion). This approach becomes infeasible as the number of covariates increases.

Forward selection

Start with no variables. In each step, add the variable that leads to the greatest improvement. Stop when no further improvement.

Backward elimination

Start with all variables available. In each step, eliminate the variable that leads to the greatest improvement. Stop when no further improvement.

Stepwise selection

a combination of forward selection and backward elimination.


Use LASSO regression. If parameter estimates are nearly 0, then eliminate these variables. Again, fit some regression model.

18.2 Variables that induce or amplify bias

Imagine you have the unlimited computational power and a dataset with a quasi-infinite number of individuals and many variables measured for each individual including a treatment A, an outcome Y, a large number of variables X. Some of X might be confounders of the effect of A on Y.

Then, Shouldn't we adjust for all measured variables X using stratification/outcome regression, standardization/g-formula, IP weighting, or g-estimation? The answer is NO! Let's see some examples where adjustment introduces biases.

Selection bias under the null

Consider the figure shown below.


There is no effect of A on Y, so the average causal effect is zero. That is,

\mathrm{E}[Y^{a=1}] - \mathrm{E}[Y^{a=0}] = 0

There is no confounding by L. Hence, we can unbiasedly estimate the causal effect by

\mathrm{E}[Y|A=1] - \mathrm{E}[Y|A=0].

See what happens if we adjust for L. We estimate the causal effect via g-formula:

\sum_l \mathrm{E}[Y | A=1, L = l]\Pr(L=l) - \sum_l \mathrm{E}[Y|A=1, L=l]\Pr(L=l).

In the above DAG, L is associated with A, so

\Pr(L = l|A) \neq \Pr(L=l).

Therefore, generally, the estimate of g-formula is not equal to  E[Y|A=1] - E[Y|A=0], thus is biased. When A-Y association adjusted for L is expected to be non-null even though the causal effect of treatment A on the outcome Y is null, we say that there is selection bias under the null.

Selection bias under the alternative or off the null

Consider another DAG.


There is no confounding by L, so

\mathrm{E}[Y^{a=1}] - \mathrm{E}[Y^{a=0}]  = \mathrm{E}[Y|A=1] - \mathrm{E}[Y|A=0].

Adjusting for L via g-formula, the average causal effect is

\sum_l \mathrm{E}[Y | A=1, L = l]\Pr(L=l) - \sum_l \mathrm{E}[Y|A=1, L=l]\Pr(L=l).

For the same reason,

\Pr(L = l|A) \neq \Pr(L=l).

Therefore, the estimate of g-formula is biased. If A-Y causation was null, A and Y would be independent whether we adjust for L or not. Only when there is a causation between A and Y, there is a bias. That is why we call this kind of bias selection bias off the null.

Overadjustment for mediators


Next, consider the DAG. An adjustment variable that is affected by A and affects Y is called a mediator. If we adjust for a mediator, that blocks the effect of A on Y that goes through L, introducing biases.

Common feature of adjustment variables that introduce bias

The adjustment variables mentioned above have the same feature. They are affected by treatment A, and thus are post-treatment variables. So, you might come up with the rule of thumb that adjustment for post-treatment variables are prohibited.

However, there is an exception where post-treatment variable can be used to block the backdoor path between treatment A and the outcome Y as shown in the figure below.

So, we cannot decide whether to adjust for L or not from the data but from information outside the data (e.g. experts' advice)

The myth of selection of adjustment variables

Next, we consider the question of adjustment for variables L. Assume that the temporal order is L A Y. It is said that an estimator that adjusts for all variables L minimizes the bias, which is a false belief. Consider the following DAG.


L is a collider, so adjustment for L introduces the bias. There is another reason to avoid the myth of adjustment for all variables.

Think about the following DAG.


If we adjusted for U, we could estimate the average causal effect unbiasedly. However, U is an unmeasured confounder, so the bias cannot be removed. Even if we adjust Z, we cannot eliminate the bias introduced by U. Rather, adjustment for Z may amplify the bias. That is, the A-Y association adjusted for Z may be larger than the A-Y association not adjusted for Z. This kind of bias is called Z-bias.

18.3 Causal Inference and Machine Learning

Putting the problems discussed previously aside, we assume that X includes o variables that may induce or amplify bias. Also, X has all confounders L of the average causal effect of A on Y. The next problem is when X is very high-dimensional.

When using the plug-in g-formula, we will estimate the mean outcome Y conditional on the variables X b(X).

b(X) = \mathrm{E}[Y|X. ]

When using IP weighting, we will estimate the probability of assigning the treatment conditional on the variable X π(X).

\pi(X) = \Pr(A|X).

Traditionally, we use a parametric model such as linear or logistic regression to estimate b(X) and π(X). But, when X is high-dimensional, for example, the number of samples is much smaller than the dimension of X, the traditional parametric models do not work. Therefore, we have to come up with solutions.


Use lasso to eliminate the variables automatically.

Machine learning algorithms

Use machine learning algorithms originally used for predictive purposes.

However, there are two major problems.

  • No guarantee that the selected variables will eliminate confounding (DR estimator can mitigate the problem because it has resistance to biases)
  • statistical black boxes with largely unknown statistical properties (variance may not be wrong, thus the confidence intervals will lose their frequentist interpretation.

In the next section, we discuss how to solve these two problems with DR estimators.

18.4 Doubly Robust Machine Learning Estimators

In this book, there is no accounting for why sampling splitting and cross-fitting, which we describe below. Under certain conditions, applying these two methods enables us to construct a consistent doubly robust estimator that will asymptotically follow a normal distribution with mean at the true value of the causa parameter.

sampling splitting

  1. Split the dataset into an estimation sample and training sample.
  2. Estimate b(x) and π(x) using training sample
  3. Compute the DR estimator using estimation sample


  1. Split the dataset into two halves.
  2. do sampling splitting
  3. Swap the roles of the estimation and training halves
  4. Average the DR estimates

18.5 Variable selection is a difficult problem

Doubly Robust Machine Learning (DRML) does not solve all our problems for at least four reasons.

  • It is not possible to identify all important confounders or to rule out variables that induce or amplify bias
  • The choice of machine learning algorithms depends on the causal structure that is usually unknown.
  • The implementation is very hard
  • The variance might be too big to be useful,

The last problem could be solved by removing the variables that is strongly associated with the treatment A. But the removed variables might be potential confounders to eliminate the bias.

The best way is to carry out multiple sensitivity analyses and look at all the results. If we get similar results, we are confident in the results. If not, let's try to understand why?


Hernán MA, Robins JM (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.