Estimating average effects in regression discontinuities with covariate interactions

2016-01-27

Regression discontinuity designs (RDDs) are now a widely used tool for program evaluation in economics and many other fields. RDDs occur in situations where some treatment/program of interest is assigned on the basis of a numerical score (called the running variable), all units scoring above a certain threshold receiving treatment and all units scoring at or below the threshold having treatment withheld (or vice versa, with treatment assigned to units scoring below the threshold). This mechanism provides a way to identify the marginal average treatment effect (MATE): the average effect of treatment assignment for units on the cusp of the threshold.

RDDs are appealing for a couple of reasons. First and foremost, RDD-like mechanism occurs all over the place, since providing treatment on the basis of a numerical measure of need/eligibility is a natural way to allocate resources. Furthermore, analysis of the designs is straight-forward, as it involves nothing more complicated than a linear regression model, estimated using (weighted or un-weighted) least squares, and which can be represented graphically using a simple scatterplot. Things get a little bit more complicated if you are trying to account for imperfect compliance with treatment assignment—as in the “fuzzy” RDD—but for the moment let me focus on “sharp” RDDs.

The simplest approach to estimating the MATE is to use a local linear regression in the neighborhood of the threshold, with the outcome regressed on the running variable, treatment indicator, and their interaction. However, in practice it is quite common to also include additional covariates in the local linear regression. If the covariates are also interacted with the treatment indicator, there is no longer a single regression coefficient corresponding to the treatment effect. In my last post, I suggested a “centering trick” for estimating the MATE based on a model that included covariate-by-treatment interactions. In this post, I’ll explain the reasoning behind this proposal.

G’day, MATE

I think it’s helpful to start by thinking about the definition of the MATE in non-parametric terms. Let \(R\) be the running variable, assumed to be centered at the threshold; \(T\) be an indicator for treatment assignment, with \(T = I(R > 0)\); and \(X\) be a covariate, which may be vector-valued. Denote the potential outcomes as \(Y^0\) (a unit’s outcome if not assigned to treatment) and \(Y^1\) (a unit’s outcome if assigned to treatment), so that the observed outcome is \(Y = Y^0 (1 - T) + Y^1 T\). Now consider the potential response surfaces

\[\begin{aligned}\mu_0(x, r) &= \text{E}\left(\left.Y^0 \right|X = x, R = r\right) \\ \mu_1(x, r) &= \text{E}\left(\left.Y^1 \right|X = x, R = r\right).\end{aligned}\]

In an RDD, the average treatment effect at a given point \((x, r)\) on the response surface is not generally identified by conditioning because one of the potential outcomes will never be observed: if \(r < 0\) then \(\text{Pr}( T = 0 \vert X = x, R = r) = 1\) and \(\text{Pr}( T = 1 \vert X = x, R = r) = 0\) (and vice versa for \(r > 0\)). However, the treatment effect for the subpopulation where \(R = 0\) can be identified under the assumption that the potential response surfaces are continuous in a neighborhood of the threshold. Thus the MATE, which can be written as

\[\begin{aligned} \delta_M &= \text{E}\left(\left. Y^1 - Y^0 \right| R = 0\right) \\ &= \text{E}\left[\mu_1(X, 0) - \mu_0(X,0)\right]. \end{aligned}\]

Regression estimation

Now assume that we have a simple random sample \(\left(y_i,r_i,t_i, x_i\right)_{i=1}^n\) of units and that each unit has a weight \(w_i\) defined based on some measure of distance from the threshold. We can use these data to estimate the response surfaces (somehow…more on that in a minute) on each side of the cut-off, with \(\hat\mu_0(x, r)\) for \(r < 0\) and \(\hat\mu_1(x, r)\) for \(r > 0\). If we then use the sample distribution of \(X\) in the neighborhood of \(R = 0\) in place of the conditional density \(d\left(X = x \vert R = 0\right)\), we can estimate the MATE as

\[\hat\delta_M = \frac{1}{W} \sum_{i=1}^n w_i \left[\hat\mu_1(x_i, 0) - \hat\mu_0(x_i, 0)\right],\]

where \(W = \sum_{i=1}^n w_i\). This is a regression estimator for \(\delta_M\). It could be non-, semi-, or fully parametric depending on the technique used to estimate the response surfaces. Note that this estimator is a little bit different than the regression estimator that would be used in the context of an observational study (see, e.g., Shafer & Kang, 2008). In that context, one would use \(\hat\mu_j(x_i, r_i)\) rather than \(\hat\mu_j(x_i, 0)\), but in an RDD doing so would involve extrapolating beyond the cutpoint (i.e., using \(\hat\mu_1(x_i, r_i)\) for \(r_i < 0\)).

Now suppose that we again use a linear regression in some neighborhood of the cut-point to estimate the response surfaces. For the (weighted) sample in the neighborhood of the cut-point, we assume that

\[\mu_{t_i}(x_i, r_i) = \beta_0 + \beta_1 r_i + \beta_2 t_i + \beta_3 r_i t_i + \beta_4 x_i + \beta_5 x_i t_i.\]

Substituting this into the formula for \(\hat\delta_M\) leads to

\[\begin{aligned}\hat\delta_M &= \frac{1}{W} \sum_{i=1}^n w_i \left[\hat\beta_2 + \hat\beta_5 x_i \right] \\ &= \hat\beta_2 + \hat\beta_5 \sum_{i=1}^n \frac{w_i x_i}{W}.\end{aligned}\]

Now, the centering trick involves nothing more than re-centering the covariate so that \(\sum_{i=1}^n w_i x_i = 0\) and \(\hat\delta_M = \hat\beta_2\). Of course, one could just use the non-parametric form of the regression estimator, but the centering trick is useful because it comes along with an easy-to-calculate standard error (since it is just a regression coefficient estimate).

Multiple covariates

All of this works out in the exact same way if you have interactions between the treatment and multiple covariates. However, there are a few tricky cases that are worth noting. If you include interactions between the treatment indicator and a polynomial function of the treatment, each term of the polynomial has to be centered. For example, if you want to control for \(x\), \(x^2\), and their interactions with treatment, you will need to calculate

\[\tilde{x}_{1i} = x_i - \frac{1}{W} \sum_{i=1}^n w_i x_i, \qquad \tilde{x}_{2i} = x_i^2 - \frac{1}{W} \sum_{i=1}^n w_i x_i^2\]

and then use these re-centered covariates in the regression

\[\mu_{t_i}(x_i, r_i) = \beta_0 + \beta_1 r_i + \beta_2 t_i + \beta_3 r_i t_i + \beta_4 \tilde{x}_{1i} + \beta_5 \tilde{x}_{2i} + \beta_6 \tilde{x}_{1i} t_i + \beta_7 \tilde{x}_{2i} t_i.\]

The same principle will also hold if you want to include higher-order interactions between covariates and the treatment: calculate the interaction term first, then re-center it. There’s one exception though. If you want to include an interaction between a covariate \(x\), the running variable, and the treatment indicator (who knows…you might aspire to do this some day…), then all you need to do is center \(x\). In particular, you should not calculate the interaction \(x_i r_i\) and then re-center it (doing so could pull the average away from the threshold of \(R = 0\)).

R, MATEs!

Here’s some R code that implements the centering trick for the simulated example from my last post:

library(sandwich)
library(lmtest)
library(rdd)

# simulate an RDD
set.seed(20160124)
simulate_RDD <- function(n = 2000, R = rnorm(n, mean = qnorm(.2))) {
  n <- length(R)
  T <- as.integer(R > 0)
  X1 <- 10 + 0.6 * (R - qnorm(.2)) + rnorm(n, sd = sqrt(1 - 0.6^2))
  X2 <- sample(LETTERS[1:4], n, replace = TRUE, prob = c(0.2, 0.3, 0.35, 0.15))
  Y0 <- 0.4 * R + 0.1 * (X1 - 10) + c(A = 0, B = 0.30, C = 0.40, D = 0.55)[X2] + rnorm(n, sd = 0.9)
  Y1 <- 0.35 + 0.3 * R + 0.18 * (X1 - 10) + c(A = -0.50, B = 0.30, C = 0.20, D = 0.60)[X2] + rnorm(n, sd = 0.9)
  Y <- (1 - T) * Y0 + T * Y1
  data.frame(R, T, X1, X2, Y0, Y1, Y)
}
RD_data <- simulate_RDD(n = 2000)

# calculate kernel weights
bw <- with(RD_data, IKbandwidth(R, Y, cutpoint = 0))
RD_data$w <- kernelwts(RD_data$R, center = 0, bw = bw)

# center the covariates
X_mat <- model.matrix(~ 0 + X2 + X1, data = RD_data)
X_cent <- as.data.frame(apply(X_mat, 2, function(x) x - weighted.mean(x, w = RD_data$w)))
RD_data_aug <- cbind(X_cent, subset(RD_data, select = c(-X1, -X2)))
cov_names <- paste(names(X_cent)[-1], collapse = " + ")

# calculate the MATE using RDestimate
RD_form <- paste("Y ~ R |", cov_names)
summary(RDestimate(as.formula(RD_form), data = RD_data_aug))

## 
## Call:
## RDestimate(formula = as.formula(RD_form), data = RD_data_aug)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|) 
## LATE       1.0894     1177          0.2981    0.10659     2.797    0.0051559
## Half-BW    0.5447      611          0.2117    0.14846     1.426    0.1539482
## Double-BW  2.1787     1832          0.2734    0.08305     3.292    0.0009949
##               
## LATE       ** 
## Half-BW       
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p
## LATE       23.30  11        1165        0
## Half-BW    10.97  11         599        0
## Double-BW  47.41  11        1820        0

# or using lm
lm_form <- paste("Y ~ R + T + R:T + T*(", cov_names,")")
lm_fit <- lm(as.formula(lm_form), weights = w, data = subset(RD_data_aug, w > 0))
coeftest(lm_fit, vcov. = vcovHC(lm_fit, type = "HC1"))["T",]

##    Estimate  Std. Error     t value    Pr(>|t|) 
## 0.298142798 0.106588790 2.797130893 0.005240719

Comments

I’ve shown that the “centering trick” is just a way to express a certain regression estimator for the marginal average treatment effect in an RDD. Having suggested that this is a good idea, I should also note a few points that might bear further investigation.

My regression estimator uses the sample distribution of \(X\) in the neighborhood of the threshold as an estimate of \(d(X = x \vert R = 0)\). This seems reasonable, but I wonder whether there might be a better approach to estimating this conditional density.
As far as I understand, the current best practice for defining the “neighborhood” of the threshold is to use weights based on a triangular kernel and an “optimal” bandwidth proposed by Imbens and Kalyanaraman (2012). The optimal bandwidth is derived for the simple RDD model with no covariates, though the authors comment that inclusion of additional covariates should not greatly affect the result unless the covariates are strongly correlated with the outcome, conditional on the running variable. However, what if interest centers on the covariate-by-treatment interaction itself, rather than just the MATE? It is not clear that the bandwidth is optimal for estimation/inference on the interaction term.
So far I’ve considered the MATE identified by a sharp RDD, in which we examine the effects of treatment assignment, regardless of whether units assigned to treatment actually received/participated in it. In fuzzy RDDs, the target parameter is the average effect of treatment receipt for those on the threshold of eligibility and who comply with the assignment rule. The effect is estimated using two-stage least squares, taking treatment assignment as an instrument for treatment receipt. I’m not entirely sure how the regression estimator approach would work in this instrumental variables setting.