# A handmade clubSandwich for multi-site trials

I’m just back from the Society for Research on Educational Effectiveness meetings, where I presented work on small-sample corrections for cluster-robust variance estimators in two-stage least squares models, which I’ve implemented in the clubSandwich R package. Here’s my presentation. So I had “clubSandwich” estimators on the brain when a colleague asked me about whether the methods were implemented in SAS.

The moderately longer answer is “not unless we can find funding to pay someone who knows how to program properly in SAS.” However, for the specific model that my colleague was interested in, it turns out that the small-sample corrections implemented in clubSandwich can be expressed in closed form, and they’re simple enough that they could easily be hand-calculated. I’ll sketch out the calculations in the remainder of this post.

## A multi-site trial

Consider a multi-site trial conducted across $$J$$ sites, which we take as a sample from a larger super-population of sites. Each site consists of $$n_j$$ units, of which $$p_j n_j$$ are randomized to treatment and the remainder $$(1 - p_j) n_j$$ are randomized to control. For each unit $$i$$ in each site $$j$$, we have an outcome $$y_{ij}$$ and a treatment indicator $$t_{ij}$$.

A conventional approach to estimating the overall average impact in this setting is to use a model with a treatment indicator and fixed effects for each site: $y_{ij} = \beta_j + \delta t_{ij} + e_{ij}$ and then to cluster the standard errors by site. Clustering by site makes sense here if (and only if) we’re interested in generalizing to the super-population of sites.

Let $$\hat\delta_j$$ denote the impact estimate from site $$j$$, calculated as the difference in means between treated and untreated units at site $$j$$: $\hat\delta_j = \frac{1}{n_j p_j} \left(\sum_{i=1}^{n_j} t_{ij} y_{ij}\right) - \frac{1}{n_j (1 - p_j)} \left(\sum_{i=1}^{n_j} (1 - t_{ij}) y_{ij}\right).$ for $$j = 1,..,J$$. The overall impact estimate here is a precision-weighted average of the site-specific impacts: $\hat\delta = \frac{1}{W} \sum_{j=1}^J w_j \hat\delta_j,$ where $$w_j = n_j p_j (1 - p_j)$$ and $$W = \sum_j w_j$$.

## Sandwich estimators

The conventional clustered variance estimator (or sandwich estimator) for $$\hat\delta$$ is a simple function of the (weighted) sample variance of the site-specific effects. It can be calculated directly as: $V^{CR0} = \frac{1}{W^2} \sum_{j=1}^J w_j^2 \left(\hat\delta_j - \hat\delta\right)^2.$ Under a conventional random effects model for the $$\delta_j$$s, this estimator has a downward bias in finite samples.

The clubSandwich variance estimator here uses an estimator for the sample variance of site-specific effects that is unbiased under a certain working model. It is only slightly more complicated to calculate: $V^{CR2} = \frac{1}{W^2} \sum_{j=1}^J \frac{w_j^2 \left(\hat\delta_j - \hat\delta\right)^2}{1 - w_j / W}.$

The other difference between conventional methods and the clubSandwich approach is in the reference distribution used to calculate hypothesis tests and confidence intervals. The conventional approach uses a standard normal reference distribution (i.e., a z-test) that is asymptotically justified. The clubSandwich approach uses a $$t$$ reference distribution, with degrees of freedom estimated using a Satterthwaite approximation. In the present context, the degrees of freedom are a little bit ugly but still not hard to calculate: $df = \left[\sum_{j=1}^J \frac{w_j^2}{(W - w_j)^2} - \frac{2}{W}\sum_{j=1}^J \frac{w_j^3}{(W - w_j)^2} + \frac{1}{W^2} \left(\sum_{j=1}^J \frac{w_j^2}{W - w_j} \right)^2 \right]^{-1}.$

In the special case that all sites are of the same size and use a constant treatment allocation, the weights become equal. The clubSandwich variance estimator then reduces to $V^{CR2} = \frac{S_\delta^2}{J} \qquad \text{where} \qquad S_\delta^2 = \frac{1}{J - 1}\sum_{j=1}^J \left(\hat\delta_j - \hat\delta\right)^2,$ and the degrees of freedom reduce to simply $$df = J - 1$$.

## Tennessee STAR

Here is a worked example of the calculations (using R of course, because my SAS programming skills atrophied years ago). I’ll use data from the famous Tennessee STAR class size experiment, which was a multi-site trial in which students were randomized to small or regular-sized kindergarten classes within each of several dozen schools. To make the small-sample issues more pronounced, I’ll limit the sample to urban schools and look at impacts of small class-size on reading and math scores at the end of kindergarten. STAR was actually a three-arm trial—the third arm being a regular-sized class but with an additional teacher aide. For simplicity (and following convention), I’ll collapse the teacher-aide condition and the regular-sized class condition into a single arm and also limit the sample to students with complete outcome data on both tests.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'tibble' was built under R version 4.0.3
## Warning: package 'readr' was built under R version 4.0.3
data(STAR, package = "AER")

STAR_urban <-
STAR %>%
filter(
# limit to urban/inner city schools
schoolk %in% c("urban","inner-city"),
# limit to complete outcome data
) %>%
droplevels() %>%
# collapse control conditions
mutate(stark = fct_collapse(stark, regular = c("regular","regular+aide"))) %>%

STAR_summary <-
STAR_urban %>%
count(schoolidk)

After these exclusions, the data include a total of 1810 students from 23 schools, ranging in size from 34 to 134 students.

For starters, let’s get the average impacts using a seeming unrelated regression specification, with both conventional and clubSandwich standard errors.

library(clubSandwich)
## Warning: package 'clubSandwich' was built under R version 4.0.3
## Registered S3 method overwritten by 'clubSandwich':
##   method    from
##   bread.mlm sandwich
STAR_fit <- lm(cbind(readk, mathk) ~ 0 + schoolidk + stark, data = STAR_urban)

# conventional SEs
CR0 <-
coef_test(STAR_fit, vcov = "CR0",
cluster = STAR_urban$schoolidk, test = "z", coefs = c("readk:starksmall","mathk:starksmall")) CR0 ## Coef. Estimate SE t-stat p-val (z) Sig. ## 1 readk:starksmall 6.16 2.73 2.25 0.0241 * ## 2 mathk:starksmall 12.13 4.79 2.53 0.0113 * # clubSandwich SEs CR2 <- coef_test(STAR_fit, vcov = "CR2", cluster = STAR_urban$schoolidk,

CR2
##              Coef. Estimate   SE t-stat d.f. p-val (Satt) Sig.
## 1 readk:starksmall     6.16 2.81   2.19   19       0.0409    *
## 2 mathk:starksmall    12.13 4.92   2.47   19       0.0234    *

Now I’ll do it “by hand”—or rather, with a bit of dplyr:

# summary statistics by site

school_summaries <-
STAR_urban %>%
group_by(schoolidk, stark) %>%
summarise(
# means by arm and site
mathk = mean(mathk),
n_arm = n()
) %>%
summarise(
# impact estimates by site
mathk = diff(mathk),
n = sum(n_arm),
p = n_arm[stark=="small"] / n
) %>%
mutate(w = n * p * (1 - p))
## summarise() regrouping output by 'schoolidk' (override with .groups argument)
## summarise() ungrouping output (override with .groups argument)
# overall impacts

school_summaries %>%
group_by(subject) %>%
summarise(
impact = weighted.mean(impact_j, w = w),
SE_CR0 = sqrt(sum(w^2 * (impact_j - impact)^2) / sum(w)^2),
SE_CR2 = sqrt(sum(w^2 * (impact_j - impact)^2 / (1 - w / sum(w))) / sum(w)^2),
df_CR2 = 1 / (sum(w^2 / (sum(w) - w)^2) -
2 * sum(w^3 / (sum(w) - w)^2) / sum(w) +
sum(w^2 / (sum(w) - w))^2 / sum(w)^2)
) %>%
knitr::kable(digits = 2)
## summarise() ungrouping output (override with .groups argument)
subject impact SE_CR0 SE_CR2 df_CR2
mathk 12.13 4.79 4.92 18.99
The CR0 and CR2 standard errors match the results from coef_test, as do the Satterthwaite degrees of freedom. Note that the degrees of freedom are equal to 19 in this case, a bit less than $$J - 1 = 22$$ due to variation in the weight assigned to each school.
Some analysts might not like the approach of using precision-weighted average of the site-specific impacts, as I’ve examined here. Instead, one might choose to weight the site-specific effects by the site-specific sample sizes, or to use some sort of random effects weighting that allows for random heterogeneity across sites. The formulas given above for conventional and clubSandwich clustered variance estimators apply directly to other weighting schemes too. Just substitute your favorite weights in place of $$w_j$$. When doing so, the clubSandwich estimator will be exactly unbiased under the assumption that your preferred weighting scheme corresponds to inverse-variance weighting, and the Satterthwaite degrees of freedom approximation will be derived under the same model.