Skip to contents

A new approach

This vignette introduces a new way to design, specify, and implement various estimators for controlled direct effects using a coding style that we refer to as the “assembly line.” Users can specify the various components of the estimators sequentially and easily swap out implementation details such as what exact estimators are used for the estimation of each nuisance function or what arguments to pass to these functions. A unified structure and output across different research designs and estimators should make comparing estimators very easy. Finally, this implementation is built around the idea of cross-fitting for nuisance function estimation.

Let’s see how this works in practice. We first load a subset of the jobcorps data from the package:

data("jobcorps")
jc <- jobcorps[
  1:200,
  c("treat", "female", "age_cat", "work2year2q", "pemplq4", "emplq4", "exhealth30")
]

This is the Job Corps experiment data used in Huber (2014) to estimate the effect of a randomized job training program on self-assessed health holding constant the intermediate outcome of employment. We use a subset of the variables from their study, but the basic variable we use here:

  • \(Y_i\) is an indicator for whether participants reported “very good” health after 2.5 years out after randomization (exhealth30)
  • \(A_i\) is an indicator for assignment to the job training program (treat)
  • \(M_i\) is an indicator for employment 1 to 1.5 years after assignment (work2year2q)
  • \(Z_i\) are the post-treatment, pre-mediator intermediate confounders (emplq4, pemplq4)
  • \(X_i\) are the pre-treatment characteristics (female, age_cat)

In context of our CDE estimators, we refer to be \(A_i\) (the treatment) and \(M_i\) (the mediator) as treatment variables because we are interested in causal effects that contrast different combinations of these two variables.

The first step in our assembly line is to choose the type of CDE estimator (or really estimation strategy) that we want to pursue. These choices include IPW (cde_ipw()), regression imputation (cde_reg_impute()), and augmented IPW (cde_aipw()) among others. Some of these different functions will take different arguments to specify different aspects of the estimator. The AIPW estimator is useful to demonstrate because it uses many of the features of the assembly line approach.

my_aipw <- cde_aipw() |>
  set_treatment(treat, ~ female + age_cat) |>
  treat_model(engine = "logit") |>
  outreg_model(engine = "lm") |>
  set_treatment(work2year2q, ~ emplq4 + pemplq4) |>
  treat_model(engine = "logit") |>
  outreg_model(engine = "lm") |>
  estimate(exhealth30 ~ treat + work2year2q, data = jobcorps)

The call here consists of several steps connected by R’s pipe operator. The first call cde_aipw() simply initializes the estimator as taking the AIPW form, meaning that it will use both propensity score models and outcome regressions to estimate the causal effects.

Following this we have two blocks of 3 lines each that look similar with a few differences. These “treatment blocks” specify the causal variables of interest in their causal ordering. In set_treatment() we set the name of the treatment variable and give a formula describing the pre-treatment covariates for this treatment. This formula, unless overridden by the next two functions, will be used in the fitting of the propensity score and outcome regression models. treat_model() allows the user to specify how to fit the propensity score model for this treatment with a choice of engine and optional arguments to pass to each engine. outreg_model() does the same for the outcome regression. We repeat this specification for the mediator.

The estimate() function finally implements the specifications given in the previous commands. Users use a formula to set the outcome and which of the treatment variables we want effects for and indicate the data frame where all of these variables can be found. By default, the nuisance functions (the propensity score model and the outcome regression) are fit using cross-fitting.

We can either use the summary() or tidy() functions to get a summary of the different effects being estimated:

summary(my_aipw)
## 
##  aipw CDE Estimator
## ---------------------------
## Causal variable: treat 
## 
## Treatment model: treat ~ female + age_cat 
## Treatment engine: logit 
## 
## Outcome model: ~female + age_cat 
## Outcome engine: lm 
## ---------------------------
## Causal variable: work2year2q 
## 
## Treatment model: work2year2q ~ emplq4 + pemplq4 + female + age_cat 
## Treatment engine: logit 
## 
## Outcome model: ~emplq4 + pemplq4 + female + age_cat 
## Outcome engine: lm 
## ---------------------------
## Cross-fit: TRUE 
## Number of folds: 5 
## 
## Estimated Effects:
##                                  Estimate Std. Error  t value Pr(>|t|)
## treat [(1, 0) vs. (0, 0)]        0.032042    0.02250  1.42395  0.15454
## treat [(1, 1) vs. (0, 1)]        0.030775    0.01397  2.20285  0.02764
## work2year2q [(0, 1) vs. (0, 0)]  0.001804    0.02058  0.08765  0.93016
## work2year2q [(1, 1) vs. (1, 0)] -0.001721    0.01667 -0.10324  0.91778
##                                  CI Lower CI Upper   DF
## treat [(1, 0) vs. (0, 0)]       -0.012075  0.07616 3818
## treat [(1, 1) vs. (0, 1)]        0.003388  0.05816 6207
## work2year2q [(0, 1) vs. (0, 0)] -0.038554  0.04216 3991
## work2year2q [(1, 1) vs. (1, 0)] -0.034399  0.03096 6034
tidy(my_aipw)
##                                            term     estimate  std.error
## treat_1_0             treat [(1, 0) vs. (0, 0)]  0.032042203 0.02250226
## treat_1_1             treat [(1, 1) vs. (0, 1)]  0.030774747 0.01397043
## work2year2q_0_1 work2year2q [(0, 1) vs. (0, 0)]  0.001804211 0.02058485
## work2year2q_1_1 work2year2q [(1, 1) vs. (1, 0)] -0.001720936 0.01666937
##                  statistic    p.value     conf.low  conf.high   df
## treat_1_0        1.4239549 0.15454126 -0.012075402 0.07615981 3818
## treat_1_1        2.2028488 0.02764203  0.003387865 0.05816163 6207
## work2year2q_0_1  0.0876475 0.93016125 -0.038553597 0.04216202 3991
## work2year2q_1_1 -0.1032394 0.91777636 -0.034398851 0.03095698 6034

Because the estimate() function included both treat and work2year2q in the formula, the output includes both the controlled direct effects of the treatment and the conditional average treatment effect of the mediator. Furthermore, by default, the output contains estimates for an effect for each combination of other treatment variables. This can be changed by setting the eval_vals argument in the relevant set_treatment() call.

The modular structure of the call allow users to easily swap out parts of the models. For example, we could easily alter the above call to use lasso versions of the regression and logit model.

my_aipw_lasso <- cde_aipw() |>
  set_treatment(treat, ~ female + age_cat) |>
  treat_model(engine = "rlasso_logit") |>
  outreg_model(engine = "rlasso") |>
  set_treatment(work2year2q, ~ emplq4 + pemplq4) |>
  treat_model(engine = "rlasso_logit") |>
  outreg_model(engine = "rlasso") |>
  estimate(exhealth30 ~ treat + work2year2q, data = jobcorps)

Here we use a “rigorous” version of the lasso from the hdm package to speed up computation. There are a number of different engines for different outcome and treatment types including:

  • GLMs: lm (OLS), logit (logistic regression), multinom (mulitnomial logit)
  • LASSO estimators (from glmnet): lasso, lasso_logit, lasso_multinom
  • Rigorous LASSO estimator (from hdm): rlasso, rlasso_logit

Huber, Martin. 2014. “Identifying Causal Mechanisms (Primarily) Based on Inverse Probability Weighting.” Journal of Applied Econometrics 29 (6): 920–43.