One way to estimate the ACDE with linear models is by sequential g-estimation, a type of structural nested mean model. The key logic of sequential g-estimation is that, under the sequential unconfoundedness assumption, the ACDE can be identified as follows:
The function above is called the “demediation function” (or “blip-down function”) and is defined as follows:
This function is the effect of switching from some level of the mediator to 0, and thus is an estimate of the causal effect of the mediator for a fixed value of the treatment and within levels of the covariates. Subtracting its estimates from the outcome will effectively remove the variation due to the mediator from the outcome, or what Acharya, Blackwell, and Sen (2016) call “demediation.”
The identification of the ACDE hold under the sequential unconfoundedness condition, using the definitions by Acharya, Blackwell, and Sen (2016).
Assumption 1 (Sequential Unconfoundedness)
First, no omitted variables for the effect of treatment () on the outcome (), conditional on the pretreatment confounders (). Second, no omitted variables for the effect of the mediator on the outcome, conditional on the treatment, pretreatment confounders, and intermediate confounders ().
While the ACDE is estimated nonparametrically with just this assumption, we need to make a further assumption to be able to use sequential g-estimation.
Assumption 2 (No intermediate interactions)
The effect of the mediator () on the outcome () is independent of the intermediate confounders ().
Without this assumption, we would have to model the multivariate distribution of the intermediate confounders in order to estimate the ACDE.
Estimating the demediation function
Under the assumption of sequential unconcoundedness, the demediation function can be estimated by a regression of the outcome on the variables in the demediation function () plus the intermediate confounders (), treatment (), and baseline confounders (). If this regression model is correctly specified, the coefficients on the variables in the demediation function are unbiased for the parameters of the demediation function. For example, when there are no interactions between the mediator and the treatment, nor between the mediator and the pretreatment confounders, the demediation function is straightforward to estimate. In this case, the demediation function is and the OLS coefficient on the mediator is an the estimate of .
Using the demediation function to estimate ACDE
The second stage of sequential g-estimation uses the demediated
outcome
If there are no interactions or nonlinearities in the demediation
function, then this becomes
.
With this demediated outcome in hand, we can estimate the ACDE by
regressing this outcome on the treatment and pre-treatment covariates:
Under the above assumptions and assuming this regression is correctly
specified, the
is an unbiased estimate of the ACDE. The sequential_g
function that we demonstrate below will complete both the estimation of
the demediation function and the estimation of the ACDE. For more
technical details on sequential g-estimation, see Acharya, Blackwell, and Sen (2016).
Illustrative example of sequential g-estimation
We now work through one example of using sequential_g
to
estimate the ACDE. The data comes from Alesina,
Giuliano, and Nunn (2013). The
dataset comes with the package:
data(ploughs)
The paper’s main argument is that the advent of the capital-intensive agricultural practice of the plow favored men over women participating in agriculture, which have affected gender inequality in modern societies. These authors find that strong effects of plow-based agriculture on female labor-force participation, but not on share of political positions held by women. The authors speculate that this might be due to the (positive) effects of the plow on modern-day income levels, which could offset the direct effects of the plow. The authors control for income and show that a significant effect of the plow appears. We evaluate this approach and try to estimate the ACDE more formally. Thus, in this case, we have the following variables, where indexes countries:
-
is the share of political positions held by women in 2000
(
women_politics
) -
is the relative proportion of ethnic groups that traditionally used the
plow within a country (
plow
) -
is the log GDP per capita in 2000, mean-centered
(
centered_ln_inc
) -
are the post-treatment, pre-mediator intermediate confounders
(
years_civil_conflict
,years_interstate_conflict
,oil_pc
,european_descent
,communist_dummy
,polity2_2000
,serv_va_gdp2000
) -
is the pre-treatment characteristics of the country, which are mostly
geographic. (
tropical_climate
,agricultural_suitability
,large_animals
,political_hierarchies
,economic_complexity
, andrugged
)
As a baseline, we first regress on controlling for the pre-treatment variables .
ate.mod <- lm(women_politics ~ plow + agricultural_suitability +
tropical_climate + large_animals + political_hierarchies +
economic_complexity + rugged, data = ploughs)
Notice that the effect of the plow is negative and insignificant.
summary(ate.mod)
##
## Call:
## lm(formula = women_politics ~ plow + agricultural_suitability +
## tropical_climate + large_animals + political_hierarchies +
## economic_complexity + rugged, data = ploughs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.060 -5.292 -2.000 3.752 25.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.5604 4.9038 3.581 0.000466 ***
## plow -2.1032 2.1270 -0.989 0.324422
## agricultural_suitability 0.9712 2.5761 0.377 0.706721
## tropical_climate -7.5137 1.9621 -3.830 0.000191 ***
## large_animals -8.7104 4.0510 -2.150 0.033201 *
## political_hierarchies 0.9359 0.9722 0.963 0.337344
## economic_complexity 1.0284 0.5430 1.894 0.060202 .
## rugged -0.6370 0.4961 -1.284 0.201137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.372 on 145 degrees of freedom
## (81 observations deleted due to missingness)
## Multiple R-squared: 0.1831, Adjusted R-squared: 0.1436
## F-statistic: 4.642 on 7 and 145 DF, p-value: 0.0001028
Implementation
In this example, we would like to estimate the controlled direct
effect of plows fixing the value of the mediator, current national
income. To do so, we must choose a value at which to fix this mediator.
The standard sequential g-estimation approach will fix the value to 0,
which is often not a substantively interesting value—it certainly is not
for income or logged income. To avoid this problem, we use the
mean-centered version of our logged income,
centered_ln_inc
, so that when we demediate with
,
it will be equivalent to demediation with
set to the mean of logged national income. Furthermore, we also use a
squared term of this centered mediator, centered_ln_incsq
,
so that we can include it in the demediation function to account for
nonlinear effects of logged income.
Next, we specify the main formula that contains a left-hand side that
specifies the dependent variable, and a right-hand side that
distinguishes between the baseline variables, the intermediate
variables, and the variables in the demediation function. Each of these
sets of variables on the left-hand side is separated by the vertical bar
|
. Thus, the formula will have the form
yvar ~ treat + xvars | zvars | mvars
. . For example, here
we specify that the demediation variables are
and
:
form_main <- women_politics ~ plow + agricultural_suitability + tropical_climate +
large_animals + political_hierarchies + economic_complexity + rugged |
years_civil_conflict + years_interstate_conflict + oil_pc + european_descent +
communist_dummy + polity2_2000 + serv_va_gdp2000 |
centered_ln_inc + centered_ln_incsq
sequential_g()
, implements sequential g-estimation in
the way outlined in the previous section. Specifically, it runs the
first stage model, constructs the demediator function, and passes it to
the second-stage model. At its simplest, it takes in the formula and a
data frame:
direct <- sequential_g(form_main, data = ploughs)
Output
As usual, we can use the summary
function to the output
of the sequential_g
function:
summary(direct)
##
## t test of coefficients:
##
## Estimate Std. Err. t value Pr(>|t|)
## (Intercept) 12.18450 3.64442 3.3433 0.001121 **
## plow -4.83879 2.34467 -2.0637 0.041312 *
## agricultural_suitability 4.57388 3.10477 1.4732 0.143458
## tropical_climate -2.18919 2.10505 -1.0400 0.300554
## large_animals -1.33001 3.40008 -0.3912 0.696401
## political_hierarchies 0.49575 1.09060 0.4546 0.650283
## economic_complexity -0.10521 0.42973 -0.2448 0.807029
## rugged -0.30869 0.47821 -0.6455 0.519888
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The standard errors here are based on the asymptotic variance derived
in Acharya, Blackwell, and Sen (2016), but it is also possible to use
the boots_g
function to compute standard errors by a simple
bootstrap:
##
## Summary of bootstrapped coefficients:
##
## Estimate Std. Err. t value 2.5 % 97.5 %
## (Intercept) 12.89400 4.40575 2.92663 5.13182 20.7075
## plow -4.84306 2.71915 -1.78109 -10.48374 -0.1912
## agricultural_suitability 4.73279 3.33064 1.42098 -0.78854 12.2190
## tropical_climate -2.07431 2.27446 -0.91200 -6.21752 2.5859
## large_animals -1.05959 4.71610 -0.22467 -11.79015 6.7012
## political_hierarchies 0.36607 1.19504 0.30633 -2.21079 2.6765
## economic_complexity -0.19011 0.53036 -0.35846 -1.16760 0.7744
## rugged -0.35215 0.53592 -0.65709 -1.38329 0.7083
The coefficient on the treatment variable (here, plow
),
is the estimate of the ACDE. The results show a negative ACDE of
plows—in other words, the effect of the plough is negative for fixed
income levels. This controlled direct effect is larger in magnitude than
the estimated the total effect:
Output attributes
There are various quantities and output objects available from the
sequential_g
function. As usual, the coefficient estimates
from the second stage can be accessed using the coef()
function:
coef(direct)
## (Intercept) plow agricultural_suitability
## 12.1845047 -4.8387928 4.5738843
## tropical_climate large_animals political_hierarchies
## -2.1891894 -1.3300131 0.4957518
## economic_complexity rugged
## -0.1052101 -0.3086924
One can access confidence intervals for the coefficients using the
usual confint
function:
confint(direct, "plow")
## 2.5 % 97.5 %
## plow -9.434262 -0.2433234
The vcov()
function returns the variance
covariance-matrix that accounts for the first stage estimation,
vcov(direct)[1:3, 1:3]
## (Intercept) plow agricultural_suitability
## (Intercept) 13.2818030 -0.5383896 0.1744604
## plow -0.5383896 5.4974789 1.2047839
## agricultural_suitability 0.1744604 1.2047839 9.6395922
Sensitivity
The cdesens
function can perform a sensitivity analysis
to evaluate how the estimated ACDE changes for various levels of
post-treatment confounding of the mediator-outcome relationship.
Currently, this function only supports a model with only one
mediator.
# update with only one mediator
direct1 <- sequential_g(women_politics ~ plow + agricultural_suitability + tropical_climate +
large_animals + political_hierarchies + economic_complexity + rugged |
years_civil_conflict + years_interstate_conflict + oil_pc + european_descent +
communist_dummy + polity2_2000 + serv_va_gdp2000 |
centered_ln_inc,
ploughs)
out_sens <- cdesens(direct1, var = "plow")
plot(out_sens)
The black line shows how the estimated ACDE would be at various
levels of correlation between the mediator and outcome errors, while the
gray bands show the 95% confidence intervals. Note that when this
correlation is 0, we recover the estimated effect from
sequential_g
. See Acharya,
Blackwell, and Sen (2016) for more
information.