A colleague recently asked me about how to apply cluster-robust hypothesis tests and confidence intervals, as calculated with the clubSandwich package, when dealing with multiply imputed datasets.
Standard methods (i.e., Rubin’s rules) for pooling estimates from multiple imputed datasets are developed under the assumption that the full-data estimates are approximately normally distributed. However, this might not be reasonable when working with test statistics based on cluster-robust variance estimators, which can be imprecise when the number of clusters is small or the design matrix of predictors is unbalanced in certain ways. Barnard and Rubin (1999) proposed a small-sample correction for tests and confidence intervals based on multiple imputed datasets. In this post, I’ll show how to implement their technique using the output of `clubSandwich`

, with multiple imputations generated using the `mice`

package.

### Setup

To begin, let me create missingness in a dataset containing multiple clusters of observations:

```
library(mlmRev)
library(mice)
library(dplyr)
data(bdf)
bdf <- bdf %>%
select(schoolNR, IQ.verb, IQ.perf, sex, ses, langPRET, aritPRET, aritPOST) %>%
mutate(
schoolNR = factor(schoolNR),
sex = as.numeric(sex)
) %>%
filter(as.numeric(schoolNR) <= 30) %>%
droplevels()
bdf_missing <-
bdf %>%
select(-schoolNR) %>%
ampute(run = TRUE)
bdf_missing <-
bdf_missing$amp %>%
mutate(schoolNR = bdf$schoolNR)
summary(bdf_missing)
```

```
## IQ.verb IQ.perf sex ses
## Min. : 4.00 Min. : 5.333 Min. :1.000 Min. :10.00
## 1st Qu.:10.50 1st Qu.: 9.333 1st Qu.:1.000 1st Qu.:20.00
## Median :11.50 Median :10.667 Median :1.000 Median :28.00
## Mean :11.66 Mean :10.747 Mean :1.471 Mean :28.65
## 3rd Qu.:13.00 3rd Qu.:12.333 3rd Qu.:2.000 3rd Qu.:38.00
## Max. :18.00 Max. :16.667 Max. :2.000 Max. :50.00
## NA's :38 NA's :40 NA's :28 NA's :47
## langPRET aritPRET aritPOST schoolNR
## Min. :15.00 Min. : 1.00 Min. : 2.00 40 : 35
## 1st Qu.:30.00 1st Qu.: 9.00 1st Qu.:12.00 54 : 31
## Median :35.00 Median :11.00 Median :18.00 55 : 30
## Mean :33.93 Mean :11.57 Mean :17.62 38 : 28
## 3rd Qu.:39.00 3rd Qu.:14.00 3rd Qu.:23.00 1 : 25
## Max. :48.00 Max. :20.00 Max. :30.00 18 : 24
## NA's :51 NA's :51 NA's :38 (Other):354
```

Now I’ll use `mice`

to create 10 multiply imputed datasets:

`Impute_bdf <- mice(bdf_missing, m=10, meth="norm.nob", seed=24)`

Am I imputing while ignoring the hierarchical structure of the data? Yes, yes I am. Is this is a good way to do imputation? Probably not. But this is a quick and dirty example, so we’re going to have to live with it.

### Model

Suppose that the goal of our analysis is to estimate the coefficients of the following regression model:

\[ \text{aritPOST}_{ij} = \beta_0 + \beta_1 \text{aritPRET}_{ij} + \beta_2 \text{langPRET}_{ij} + \beta_3 \text{sex}_{ij} + \beta_4 \text{SES}_{ij} + e_{ij}, \]

where \(i\) indexes students and \(j\) indexes schools, and where we allow for the possibility that errors from the same cluster are correlated in an unspecified way. With complete data, we could estimate the model by ordinary least squares and then use `clubSandwich`

to get standard errors that are robust to within-cluster dependence and heteroskedasticity. The code for this is as follows:

```
library(clubSandwich)
lm_full <- lm(aritPOST ~ aritPRET + langPRET + sex + ses, data = bdf)
coef_test(lm_full, cluster = bdf$schoolNR, vcov = "CR2")
```

```
## Coef Estimate SE d.f. p-val (Satt) Sig.
## 1 (Intercept) -2.1921 1.3484 22.9 0.1177
## 2 aritPRET 1.0053 0.0833 23.4 <0.001 ***
## 3 langPRET 0.2758 0.0294 24.1 <0.001 ***
## 4 sex -1.2040 0.4706 23.8 0.0173 *
## 5 ses 0.0233 0.0266 20.5 0.3909
```

If cluster dependence were no concern, we could simply use the model-based standard errors and test statistics. The `mice`

package provides functions that will fit the model to each imputed dataset and then combine them by Rubin’s rules. The code is simply:

```
with(data = Impute_bdf,
lm(aritPOST ~ aritPRET + langPRET + sex + ses)
) %>%
pool(method = "Rubin") %>%
summary() %>%
round(3)
```

```
## est se t df Pr(>|t|) lo 95 hi 95 nmis fmi
## (Intercept) -1.718 1.148 -1.497 1759.752 0.135 -3.969 0.533 NA 0.073
## aritPRET 0.975 0.074 13.111 336.195 0.000 0.828 1.121 51 0.169
## langPRET 0.259 0.038 6.757 532.218 0.000 0.184 0.334 51 0.133
## sex -1.172 0.430 -2.728 494.551 0.007 -2.016 -0.328 28 0.138
## ses 0.040 0.021 1.863 208.718 0.064 -0.002 0.082 47 0.215
## lambda
## (Intercept) 0.072
## aritPRET 0.164
## langPRET 0.130
## sex 0.135
## ses 0.208
```

However, this approach ignores the possibility of correlation in the errors of units in the same cluster, which is clearly a concern in this dataset:

```
# ratio of CRVE to conventional variance estimates
diag(vcovCR(lm_full, cluster = bdf$schoolNR, type = "CR2")) /
diag(vcov(lm_full))
```

```
## (Intercept) aritPRET langPRET sex ses
## 1.5296837 1.5493134 0.6938735 1.4567650 2.0053186
```

So we need a way to pool results based on the cluster-robust variance estimators, while also accounting for the relatively small number of clusters in this dataset.

### Barnard & Rubin (1999)

Barnard and Rubin (1999) proposed a small-sample correction for tests and confidence intervals based on multiple imputed datasets that seems to work in this context. Rather than using large-sample normal approximations for inference, they derive an approximate degrees-of-freedom that combines uncertainty in the standard errors calculated from each imputed dataset with between-imputation uncertainty. The method is as follows.

Suppose that we have \(m\) imputed datasets. Let \(\hat\beta_{(j)}\) be the estimated regression coefficient from imputed dataset \(j\), with (in this case cluster-robust) sampling variance estimate \(V_{(j)}\). Further, let \(\eta_{(j)}\) be the degrees of freedom corresponding to \(V_{(j)}\). To combine these estimates, calculate the averages across multiply imputed datasets:

\[ \bar\beta = \frac{1}{m}\sum_{j=1}^m \hat\beta_{(j)}, \qquad \bar{V} = \frac{1}{m}\sum_{j=1}^m V_{(j)}, \qquad \bar\eta = \frac{1}{m}\sum_{j=1}^m \eta_{(j)}. \]

Also calculate the between-imputation variance

\[ B = \frac{1}{m - 1} \sum_{j=1}^m \left(\hat\beta_{(j)} - \bar\beta\right)^2 \]

And then combine the between- and within- variance estimates using Rubin’s rules:

\[ V_{total} = \bar{V} + \frac{m + 1}{m} B. \]

The degrees of freedom associated with \(V_{total}\) modify the estimated complete-data degrees of freedom \(\bar\eta\) using quantities that depend on the fraction of missing information in a coefficient. The fraction of missing information is given by

\[ \hat\gamma_m = \frac{(m+1)B}{m V_{total}} \]

The degrees of freedom are then given by

\[ \nu_{total} = \left(\frac{1}{\nu_m} + \frac{1}{\nu_{obs}}\right)^{-1}, \]

where

\[ \nu_m = \frac{(m - 1)}{\hat\gamma_m^2}, \quad \text{and} \quad \nu_{obs} = \frac{\bar\eta (\bar\eta + 1) (1 - \hat\gamma)}{\bar\eta + 3}. \]

Hypothesis tests and confidence intervals are based on the approximation

\[ \frac{\bar\beta - \beta_0}{\sqrt{V_{total}}} \ \stackrel{\cdot}{\sim} \ t(\nu_{total}) \]

### Implementation

Here is how to carry out these calculations using the results of `clubSandwich::coef_test`

and a bit of `dplyr`

:

```
# fit results with clubSandwich standard errors
models_robust <- with(data = Impute_bdf,
lm(aritPOST ~ aritPRET + langPRET + sex + ses) %>%
coef_test(cluster=bdf$schoolNR, vcov="CR2")
)
# pool results with clubSandwich standard errors
robust_pooled <-
models_robust$analyses %>%
# add coefficient names as a column
lapply(function(x) {
x$coef <- row.names(x)
x
}) %>%
bind_rows() %>%
as.data.frame() %>%
# summarize by coefficient
group_by(coef) %>%
summarise(
m = n(),
B = var(beta),
beta_bar = mean(beta),
V_bar = mean(SE^2),
eta_bar = mean(df)
) %>%
mutate(
# calculate intermediate quantities to get df
V_total = V_bar + B * (m + 1) / m,
gamma = ((m + 1) / m) * B / V_total,
df_m = (m - 1) / gamma^2,
df_obs = eta_bar * (eta_bar + 1) * (1 - gamma) / (eta_bar + 3),
df = 1 / (1 / df_m + 1 / df_obs),
# calculate summary quantities for output
se = sqrt(V_total),
t = beta_bar / se,
p_val = 2 * pt(abs(t), df = df, lower.tail = FALSE),
crit = qt(0.975, df = df),
lo95 = beta_bar - se * crit,
hi95 = beta_bar + se * crit
)
robust_pooled %>%
select(coef, est = beta_bar, se, t, df, p_val, lo95, hi95, gamma) %>%
mutate_at(vars(est:gamma), round, 3)
```

```
## # A tibble: 5 x 9
## coef est se t df p_val lo95 hi95 gamma
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.72 1.42 - 1.21 19.6 0.241 -4.69 1.25 0.0470
## 2 aritPRET 0.975 0.0930 10.5 18.5 0 0.779 1.17 0.104
## 3 langPRET 0.259 0.0360 7.29 17.8 0 0.184 0.333 0.151
## 4 ses 0.0400 0.0320 1.24 16.8 0.233 -0.0280 0.108 0.0920
## 5 sex -1.17 0.469 - 2.50 18.9 0.0220 -2.15 -0.191 0.113
```

It is instructive to compare the calculated `df`

to `eta_bar`

and `df_m`

:

```
robust_pooled %>%
select(coef, df, df_m, eta_bar) %>%
mutate_at(vars(df, df_m, eta_bar), round, 1)
```

```
## # A tibble: 5 x 4
## coef df df_m eta_bar
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) 19.6 4138 22.4
## 2 aritPRET 18.5 827 22.9
## 3 langPRET 17.8 393 23.8
## 4 ses 16.8 1069 20.5
## 5 sex 18.9 701 23.7
```

Here, `eta_bar`

is the average of the complete data degrees of freedom, and it can be seen that the total degrees of freedom are somewhat less than the average complete-data degrees of freedom. This is by construction. Further `df_m`

is the conventional degrees of freedom used in multiple-imputation, which assume that the complete-data estimates are normally distributed, and in this example they are way far off.

### Further thoughts

How well does this method perform in practice? I’m not entirely sure—I’m just trusting that Barnard and Rubin’s approximation is sound and would work in this setting (I mean, they’re smart people!). Are there other, better approaches? Totally possible. I have done zero literature review beyond the Barnard and Rubin paper. In any case, exploring the performance of this method (and any other alternatives) seems like it would make for a very nice student project.

There’s also the issue of how to do tests of multi-dimensional constraints (i.e., F-tests). The `clubSandwich`

package implements Wald-type tests for multi-dimensional constraints, using a small-sample correction that we developed (Tipton & Pustejovsky, 2015; Pustejovsky & Tipton, 2016). But it would take some further thought to figure out how to handle multiply imputed data with this type of test…