Corrigendum to Pustejovsky and Tipton (2018)

Theorem 2 is incorrect as stated

2022-09-28

\[ \def\Pr{{\text{Pr}}} \def\E{{\text{E}}} \def\Var{{\text{Var}}} \def\Cov{{\text{Cov}}} \def\bm{\mathbf} \def\bs{\boldsymbol} \] In my 2018 paper with Beth Tipton, published in the Journal of Business and Economic Statistics, we considered how to do cluster-robust variance estimation in fixed effects models estimated by weighted (or unweighted) least squares. A careful reader, Dr. Michael Pfaffermayr, recently alerted us to a problem with Theorem 2 in the paper, which concerns a computational short cut for a certain cluster-robust variance estimator in models with cluster-specific fixed effects. The theorem is incorrect as stated, and we are currently working on issuing a correction for the published version of the paper. In the interim, this post details the problem with Theorem 2. I’ll first review the CR2 variance estimator, then describe the assertion of the theorem, and then provide a numerical counter-example demonstrating that the assertion is not correct as stated.

A fixed effects model

For data that can be grouped into \(m\) clusters of observations, we considered the model \[ \bm{y}_i = \bm{R}_i \bs\beta + \bm{S}_i \bs\gamma + \bm{T}_i \bs\mu + \bs\epsilon_i, \tag{1} \] where \(\bm{y}_i\) is an \(n_i \times 1\) vector of responses for cluster \(i\), \(\bm{R}_i\) is an \(n_i \times r\) matrix of focal predictors, \(\bm{S}_i\) is an \(n_i \times s\) matrix of additional covariates that vary across multiple clusters, and \(\bm{T}_i\) is an \(n_i \times t\) matrix encoding cluster-specific fixed effects, all for \(i = 1,...,m\). The cluster-specific fixed effects satisfy \(\bm{T}_h \bm{T}_i' = \bm{0}\) for \(h \neq i\). Interest centers on inference for the coefficients on the focal predictors \(\bs\beta\).

We considered estimation of Model (1) by weighted least squares (WLS), possibly under a working model for the distribution of \(\bs\epsilon_i\). Let \(\bm{W}_1,...,\bm{W}_m\) be a set of symmetric weight matrices used for WLS estimation. Sometimes, these weight matrices may be diagonal, consisting of sampling weights for each observation. Other times, the weight matrices may involve off-diagonal terms as well. Consider a working model \(\Var\left(\bs\epsilon_i | \bm{R}_i, \bm{S}_i, \bm{T}_i\right) = \sigma^2 \bs\Phi_i\) where \(\bs\Phi_i\) is a symmetric \(n_i \times n_i\) matrix that may be a function of a low-dimensional, estimable parameter. Based on this working model, the weight matrices might be taken as \(\bm{W}_i = \bs{\hat\Phi}_i^{-1}\), where \(\bs{\hat\Phi}_i\) is an estimate of \(\bs\Phi_i\).

The CR2 variance estimator

In the paper, we provide a generalization of the bias-reduced linearization estimator introduced by McCaffrey et al. (2001) and Bell & McCaffrey (2002) that can be applied to Model (1). The variance estimator is effectively a generalization of the HC2 correction for heteroskedasticity-robust standard errors, but that works for models with within-cluster dependence and cluster-specific fixed effects, and so we refer to it the “CR2” estimator.

In order to define the CR2 variance estimator and explain the issue with Theorem 2, I’ll need to lay down a bit more notation. Let \(N = \sum_{i=1}^m n_i\) be the total sample size. Let \(\bm{U}_i = \left[ \bm{R}_i \ \bm{S}_i \right]\) be the set of predictors that vary across clusters and \(\bm{X}_i = \left[ \bm{R}_i \ \bm{S}_i \ \bm{T}_i \right]\) be the full set of predictors. Let \(\bm{R}\), \(\bm{S}\), \(\bm{T}\), \(\bm{U}\), and \(\bm{X}\) denote the stacked versions of the cluster-specific matrices (i.e., \(\bm{R} = \left[\bm{R}_1' \ \bm{R}_2' \ \cdots \ \bm{R}_m'\right]'\), etc.). Let \(\bm{W} = \bigoplus_{i=1}^m \bm{W}_i\) and \(\bs\Phi = \bigoplus_{i=1}^m \bs\Phi_i\). For a generic matrix \(\bm{Z}\), let \(\bm{M}_{Z} = \left(\bm{Z}'\bm{W}\bm{Z}\right)^{-1}\) and \(\bm{H}_{\bm{Z}} = \bm{Z} \bm{M}_{\bm{Z}}\bm{Z}'\bm{W}\). Let \(\bm{C}_i\) be the \(n_i \times N\) matrix that selects the rows of cluster \(i\) from the full set of observations, such that \(\bm{X}_i = \bm{C}_i \bm{X}\). These operators provide an easy way to define absorbed versions of the predictors. Specifically, let \(\bm{\ddot{S}} = \left(\bm{I} - \bm{H}_{\bm{T}}\right) \bm{S}\) be the covariates after absorbing (i.e., partialling out) the cluster-specific effects, let \(\bm{\ddot{U}} = \left(\bm{I} - \bm{H}_{\bm{T}}\right) \bm{U}\) be an absorbed version of the focal predictors and the covariates, and let \(\bm{\ddot{R}} = \left(\bm{I} - \bm{H}_{\bm{\ddot{S}}}\right)\left(\bm{I} - \bm{H}_{\bm{T}}\right) \bm{R}\) be the focal predictors after absorbing the covariates and the cluster-specific fixed effects.

With this notation established, the CR2 variance estimator has the form \[ \bm{V}^{CR2} = \bm{M}_{\bm{\ddot{R}}} \left(\sum_{i=1}^m \bm{\ddot{R}}_i' \bm{W}_i \bm{A}_i \bm{e}_i \bm{e}_i' \bm{A}_i \bm{W}_i \bm{\ddot{R}}_i \right) \bm{M}_{\bm{\ddot{R}}}, \] where \(\bm{\ddot{R}}_i = \bm{C}_i \bm{\ddot{R}}\) is the cluster-specific matrix of absorbed focal predictors, \(\bm{e}_i\) is the vector of weighted least squares residuals from cluster \(i\), and \(\bm{A}_1,...,\bm{A}_m\) are a set of adjustment matrices that correct the bias of the residual cross-products. The adjustment matrices are calculated as follows. Let \(\bm{D}_i\) be the upper-right Cholesky factorization of \(\bm{\Phi}_i\) and define the matrices \[ \bm{B}_i = \bm{D}_i \bm{C}_i \left(\bm{I} - \bm{H}_{\bm{X}}\right) \bs\Phi \left(\bm{I} - \bm{H}_{\bm{X}}\right)'\bm{C}_i' \bm{D}_i' \tag{2} \] for \(i = 1,...,m\). The adjustment matrices are then calculated as \[ \bm{A}_i = \bm{D}_i' \bm{B}_i^{+1/2} \bm{D}_i, \tag{3} \] where \(\bm{B}_i^{+1/2}\) is the symmetric square root of the Moore-Penrose inverse of \(\bm{B}_i\). Theorem 1 in the paper shows that, if the working model \(\bs\Phi\) is correctly specified and some conditions on the rank of \(\bm{U}\) are satisfied, then the CR2 estimator is exactly unbiased for the sampling variance of the weighted least squares estimator of \(\bs\beta\). Across multiple simulation studies, it’s been observed that the CR2 estimator also works well and outperforms alternative sandwich estimators even when the working model is not correctly specified.

Theorem 2

The adjustment matrices given in (3) can be expensive to compute directly because the \(\bm{B}_i\) matrices involve computing a “residualized” version of the \(N \times N\) matrix \(\bs\Phi\) involving the full set of predictors \(\bm{X}\)—including the cluster-specific fixed effects \(\bm{T}_1,...,\bm{T}_m\). Theorem 2 considered whether one can take a computational short cut by omitting the cluster-specific fixed effects from the calculation of the \(\bm{B}_i\) matrices. Specifically, define the modified matrices \[ \bm{\tilde{B}}_i = \bm{D}_i \bm{C}_i \left(\bm{I} - \bm{H}_{\bm{\ddot{U}}}\right) \bs\Phi \left(\bm{I} - \bm{H}_{\bm{\ddot{U}}}\right)'\bm{C}_i' \bm{D}_i' \tag{4} \] and \[ \bm{\tilde{A}}_i = \bm{D}_i' \bm{\tilde{B}}_i^{+1/2} \bm{D}_i, \tag{5}. \] Theorem 2 claims that if the weight matrices are inverse of the working model, such that \(\bm{W}_i = \bs\Phi_i^{-1}\) for \(i = 1,...,m\), then \(\bm{\tilde{B}}_i^{+1/2} = \bm{B}_i^{+1/2}\) and hence \(\bm{\tilde{A}}_i = \bm{A}_i\). The implication is that the cluster-specific fixed effects can be ignored when calculating the adjustment matrices. However, the claimed equivalence does not actually hold.

Here is a simple numerical example that contradicts the assertion of Theorem 2. I first create a predictor matrix consisting of 4 clusters, a single focal predictor, and cluster-specific fixed effects.

set.seed(20220926)
m <- 4                                             # number of clusters
ni <- 2 + rpois(m, 3.5)                            # cluster sizes
N <- sum(ni)                                       # total sample size
id <- factor(rep(LETTERS[1:m], ni))                # cluster ID
R <- rnorm(N)                                      # focal predictor
dat <- data.frame(R, id)                           # create raw data frame
X <- model.matrix(~ R + id + 0, data = dat)        # full predictor matrix
Ui <- tapply(R, id, \(x) x - mean(x))              # absorbed version of R
U <- unsplit(Ui, id)

Consider a model estimated by ordinary least squares, where the assumed working model is homoskedastic and independent errors, so \(\bs\Phi_i = \bm{I}_i\), an \(n_i \times n_i\) identity matrix (with no parameters to estimate). In this case, the adjustment matrices simplify considerably, to \[ \bm{A}_i = \left(\bm{I}_i - \bm{X}_i \bm{M}_{X} \bm{X}_i' \right)^{+1/2} \qquad \text{and} \qquad \bm{\tilde{A}}_i = \left(\bm{I}_i - \bm{\ddot{U}}_i \bm{M}_{\ddot{U}} \bm{\ddot{U}}_i' \right)^{+1/2}. \] I calculate these directly as follows:

matrix_power <- function(x, p) {
  eig <- eigen(x, symmetric = TRUE)
  val_p <- with(eig, ifelse(values > 10^-12, values^p, 0))
  with(eig, vectors %*% (val_p * t(vectors)))
}

MX <- solve(crossprod(X))
B <- 
  by(X, id, as.matrix) |>
  lapply(\(x) diag(nrow(x)) - x %*% MX %*% t(x))
A <- lapply(B, matrix_power, p = -1/2)
  
MU <- 1 / crossprod(U)
Btilde <- lapply(Ui, \(x) diag(length(x)) - x %*% MU %*% t(x))
Atilde <- lapply(Btilde, matrix_power, p = -1/2)

Here are the adjustment matrices based on the full predictor matrix \(\bm{X}\):

print(A, digits = 3)

## $A
##        [,1]   [,2]   [,3]   [,4]   [,5]
## [1,]  0.853 -0.198 -0.207 -0.191 -0.257
## [2,] -0.198  0.800 -0.200 -0.200 -0.202
## [3,] -0.207 -0.200  0.801 -0.201 -0.192
## [4,] -0.191 -0.200 -0.201  0.802 -0.210
## [5,] -0.257 -0.202 -0.192 -0.210  0.860
## 
## $B
##        [,1]   [,2]   [,3]
## [1,]  0.668 -0.338 -0.330
## [2,] -0.338  0.683 -0.345
## [3,] -0.330 -0.345  0.675
## 
## $C
##        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]
## [1,]  0.873 -0.206 -0.163 -0.105 -0.233 -0.166
## [2,] -0.206  0.873 -0.171 -0.229 -0.100 -0.167
## [3,] -0.163 -0.171  0.834 -0.160 -0.173 -0.167
## [4,] -0.105 -0.229 -0.160  0.931 -0.271 -0.166
## [5,] -0.233 -0.100 -0.173 -0.271  0.946 -0.168
## [6,] -0.166 -0.167 -0.167 -0.166 -0.168  0.833
## 
## $D
##        [,1]   [,2]   [,3]
## [1,]  0.797 -0.342 -0.455
## [2,] -0.342  0.667 -0.325
## [3,] -0.455 -0.325  0.780

Compare the above with the adjustment matrices based on the absorbed predictors only:

print(Atilde, digits = 3)

## $A
##          [,1]      [,2]     [,3]      [,4]     [,5]
## [1,]  1.05313  0.001860 -0.00742  0.008995 -0.05657
## [2,]  0.00186  1.000065 -0.00026  0.000315 -0.00198
## [3,] -0.00742 -0.000260  1.00104 -0.001257  0.00790
## [4,]  0.00900  0.000315 -0.00126  1.001523 -0.00958
## [5,] -0.05657 -0.001980  0.00790 -0.009576  1.06022
## 
## $B
##          [,1]     [,2]     [,3]
## [1,]  1.00139 -0.00478  0.00339
## [2,] -0.00478  1.01642 -0.01163
## [3,]  0.00339 -0.01163  1.00824
## 
## $C
##           [,1]      [,2]      [,3]      [,4]     [,5]      [,6]
## [1,]  1.039180 -0.039378  4.00e-03  0.061921 -0.06632  5.94e-04
## [2,] -0.039378  1.039577 -4.02e-03 -0.062234  0.06665 -5.97e-04
## [3,]  0.003999 -0.004019  1.00e+00  0.006320 -0.00677  6.07e-05
## [4,]  0.061921 -0.062234  6.32e-03  1.097861 -0.10481  9.39e-04
## [5,] -0.066317  0.066651 -6.77e-03 -0.104808  1.11225 -1.01e-03
## [6,]  0.000594 -0.000597  6.07e-05  0.000939 -0.00101  1.00e+00
## 
## $D
##          [,1]     [,2]    [,3]
## [1,]  1.13078 -0.00914 -0.1216
## [2,] -0.00914  1.00064  0.0085
## [3,] -0.12165  0.00850  1.1131

The matrices differ:

all.equal(A, Atilde)

## [1] "Component \"A\": Mean relative difference: 0.6073885"
## [2] "Component \"B\": Mean relative difference: 0.7403564"
## [3] "Component \"C\": Mean relative difference: 0.5671847"
## [4] "Component \"D\": Mean relative difference: 0.6682793"

Thus, Theorem 2 is incorrect as stated. (I have yet to identify the mis-step in the proof as given in the supplementary materials of the paper.)

Further thoughts

For this particular model specification, it is interesting to note that \(\bm{\tilde{A}}_i = \bm{A}_i + \bm{T}_i \bm{M}_{\bm{T}} \bm{T}_i'\). Because \(\bm{\ddot{U}}_i' \bm{T}_i = \bm{0}\), it follows that \[ \bm{\ddot{U}}_i' \bm{\tilde{A}}_i = \bm{\ddot{U}}_i' \left(\bm{A}_i + \bm{T}_i \bm{M}_{\bm{T}} \bm{T}_i' \right) = \bm{\ddot{U}}_i' \bm{A}_i. \] This holds in the numerical example:

UiAtilde <- mapply(\(u, a) t(u) %*% a, u = Ui, a = Atilde, SIMPLIFY = FALSE)
UiA <- mapply(\(u, a) t(u) %*% a, u = Ui, a = A, SIMPLIFY = FALSE)
all.equal(UiAtilde, UiA)

## [1] TRUE

Thus, although the exact statement of Theorem 2 is incorrect, the substantive implication actually still holds. For this particular example, computing the CR2 variance estimator using the short-cut adjustment matrices \(\bm{\tilde{A}}_1,...,\bm{\tilde{A}}_m\) is equivalent to computing the CR2 variance estimator using the full model adjustment matrices \(\bm{A}_1,...,\bm{A}_m\). However, I have not yet been able to work out the general conditions under which this equivalence holds. It may require stricter conditions than those assumed in Theorem 2.

References

Bell, R. M., & McCaffrey, D. F. (2002). Bias reduction in standard errors for linear regression with multi-stage samples. Survey Methodology, 28(2), 169–181.

McCaffrey, D. F., Bell, R. M., & Botts, C. H. (2001). Generalizations of biased reduced linearization. Proceedings of the Annual Meeting of the American Statistical Association.