Small-sample adjustments for tests of moderators and model fit using robust variance estimation in meta-regression

March 6, 2015

Meta-analysis and meta-regression

When one has many intervention studies conducted on a single topic, we may want to pool the results:

Meta-analysis lets us pools results across studies to obtain estimates of overall efficacy
Meta-regression lets us answer further questions about variation in efficacy.
For example, "Do the results vary in relation to…"
- Features of the participants in the experiment (e.g., children, teenagers)
- Dosage (e.g., weeks)
- Outcomes measured (e.g., total math, subscale scores, science)
- Study design (e.g., RCT, quasi-experiment)

Dependent effect sizes

In meta-analysis, studies often report multiple effect sizes
- Outcomes from multiple tests on the same participants (e.g., math, reading)
- Multiple measures of performance on the same participants (e.g., accuracy, response time)
- Outcomes at multiple time points (e.g., 1-week, 1-month, 1-year)
- Outcomes from multiple experiments (with different participants, but in the same lab)

Model-based meta-analysis has provided two methods for pooling:
- Univariate meta-analysis, where each study contributes a single effect size, or
- Multivariate meta-analysis, where the covariance structure of the multiple effect sizes is known.

Neither approach is ideal
- Univariate meta-analysis results in a loss of information
- Multi-variate meta-analysis requires information that is rarely reported in studies.

In this talk, we will focus on this second approach and its robust alternative.

Meta-regression model

If each study contributes multiple effect sizes, then the general meta-regression model can be written in vector form: \[\mathbf{T}_j = \mathbf{X}_j \beta + \epsilon_j\] for \(j = 1,...,m\) studies, where

\(\mathbf{T}_j\) is a vector of \(n_j\) effect size estimates from study \(j\)
\(\mathbf{X}_j\) is a \(n_j \times p\) matrix of covariates for study \(j\)
\(\beta\) is a vector of \(p\) meta-regression coefficients
\(\epsilon_j\) is a vector of residual errors for study \(j\) with covariance matrix \(\Sigma_j\)

Given a set of weights, we can estimate \(\beta\) using weighted least squares: \[\mathbf{b} = \mathbf{M} \sum_{j=1}^m \mathbf{X}_j' \mathbf{W}_j \mathbf{T}_j, \qquad \text{where} \qquad \mathbf{M} = \left(\sum_{j=1}^m \mathbf{X}_j' \mathbf{W}_j \mathbf{X}_j \right)^{-1}\]

Model-based meta-regression

Estimating the standard error of \(\mathbf{b}\) is more difficult.

If we assume that the weights are inverse variance, with \(\mathbf{W}_j = \Sigma_j^{-1}\), then \(\text{Var}\left(\mathbf{b}\right) = \mathbf{M}\).
This is the multivariate meta-analysis approach, which is "model based."
- It requires correct specification of the covariance matrices \(\Sigma_j\) and the associated weights \(W_j\).
If the true structure of the errors is unknown or mis-specified, then \(\text{Var}\left(\mathbf{b}\right)\) is wrong.

Robust variance estimation

Robust variance estimation (RVE; Hedges, Tipton, & Johnson, 2010) produces asymptotically valid estimates of the variance of \(\mathbf{b}\), even if the error structure is mis-specified.

RVE uses a "sandwich" estimator: \[\mathbf{V}^R = \mathbf{M} \left(\sum_{j=1}^m \mathbf{X}_j' \mathbf{W}_j \mathbf{e}_j \mathbf{e}_j' \mathbf{W}_j \mathbf{X}_j \right) \mathbf{M}\] where \(\mathbf{e}_j = \mathbf{T}_j - \mathbf{X}_j \mathbf{b}\).

Hypothesis testing

In large samples, we can use this variance estimator to construct hypothesis tests. For testing \(\beta_s = 0\), \[z = b_s / \sqrt{V^R_{ss}}\] follows a standard normal distribution if \(m\) is "big enough."

In smaller samples, Hedges et al. (2010) suggested that a t-distribution may be more appropriate, with \[t = b_s / \sqrt{V^R_{ss}\left(\frac{m}{m-p}\right)}\] compared to a t-distribution with \(m - p\) degrees of freedom.

Tests of multiple meta-regression coefficients

Some hypotheses involve more than one meta-regression coefficient
- Test equality of several levels of a moderator
- Test of overall model fit
We consider linear hypotheses of the form \[\mathbf{C} \beta = \mathbf{c}\] for \(q \times p\) contrast matrix \(\mathbf{C}\) and \(q \times 1\) vector \(\mathbf{c}\).
We can construct a Wald test statistic: \[Q = \left(\mathbf{C}\mathbf{b} - \mathbf{c}\right)' \left(\mathbf{C} \mathbf{V}^R \mathbf{C}'\right)^{-1} \left(\mathbf{C}\mathbf{b} - \mathbf{c}\right)\]
In large samples, we would expect \(Q\) to follow a chi-squared distribution with \(q\) degrees of freedom.
In smaller samples, an F-test might be better, with
\[Q / q \quad \dot{\sim} \quad F(q, m - p)\] But how does this test perform?

In many research syntheses, it will be of interest to test hypotheses that involve more than one meta-regression coefficient.
For example, the meta-regression might include a categorical moderator with three or more levels, such as whether the study design was a randomized experiment, a strong quasi-experiment, or a weak quasi-experiment. One might want to test whether this categorical moderator explains variation in the effect sizes, which can be done by testing whether a set of meta-regression coefficients are equal to zero.
One might also want to test whether the overall meta-regression specification has any predictive power–a so-called omnibus test for the model.
Formally, such hypotheses can be expressed as the hypothesis that the \(q \times p\) matrix big \(\mathbf{C}\) times the vector of coefficients is equal to the vector of constants little \(\mathbf{c}\).
A standard way to test hypotheses of this form is with the Wald statistic \(Q\).
In large samples, \(Q\) follows a chi-squared distribution with degrees of freedom equal to little \(q\), the number of constraints in the hypothesis.
Following the logic of Hedges' correction for the t-statistic, we might expect that in small samples, an F test would be better. Specifically, we would compare \(Q / q\) to an F distribution with \(q\) and \(m - p\) degrees of freedom. But this is an ad hoc correction.
To find out whether this is a reasonable thing to do, we put together a rather large simulation study. Rather than explaining all the nitty-gritty of how we designed it, I will ask you to take my word that it was large (as in 10s of 1000s of hours of computing time), it investigated a variety of types of covariates, combinations of covariate constraints, sample sizes, degrees of imbalance, and degrees of model mis-specification.

Simulated type-I error rate of F-test

Small-sample corrections

The originally proposed t-tests have inflated Type-I error with fewer than 40 studies (Hedges et al., 2010; Tipton, 2013, 2014; Williams, 2012).

Tipton (in press) devised small-sample corrections for t-tests. These corrections involve two parts:
- Adjustments to the variance estimator \(V^R\)
- Estimated degrees of freedom for the t-distribution

The focus of this paper is on developing similar small-sample methods for F-tests.

Corrections to the RVE covariance matrix

Corrections to the RVE estimator based on McCaffrey, Bell, & Botts' (2001) "bias-reduced linearization" approach, using a working model for the error structure: \[\mathbf{V}^R = \mathbf{M} \left(\sum_{j=1}^m \mathbf{X}_j' \mathbf{W}_j \mathbf{A}_j \mathbf{e}_j \mathbf{e}_j' \mathbf{A}_j' \mathbf{W}_j \mathbf{X}_j \right) \mathbf{M}\] where the adjustment matrices \(\mathbf{A}_1,...,\mathbf{A}_m\) are chosen so that \(\text{E}\left(\mathbf{V}^R\right) = \mathbf{M}\) when the working model is correct.

Simulation results (for both the t-test and F-test) indicate that the correction helps even if the working model is incorrect.

Potential corrections for F-tests

The small-sample t-test developed by Tipton (in press) also adjusted the degrees of freedom.
- These were estimated using a Satterthwaite approximation.
- These degrees of freedom vary in relation to the sample size \(m\), the number of parameters \(p\) and features of the covariate.
By extension, we will look for a degrees-of-freedom correction for F test.
Drawing on extant literature, we investigated a wide variety of possible corrections.
Eigenvalue decompositions
- Fai-Cornelius (1996): mixed models
- Cai-Hayes (2008): heteroskedasticity robust standard errors
Hotellings T-squared approximation
- Zhang (2012, 2013): heteroskedastic ANOVA/MANOVA
- Pan-Wall (2002): generalized estimating equations

The Winner: \(T^2_Z\)

The paper provides results for five different corrections. Here, however, we'll focus on only the one that works best.
The \(T^2_Z\) approach involves:
- Finding the mean and variance of robust covariance matrix (under a working model)
- Approximating the distribution of robust covariance matrix using a Wishart distribution
- Matching the mean and total variance of the robust covariance matrix to estimate the Wishart degrees of freedom
- \(Q/q\) tested against Hotelling's \(T^2\) distribution
The \(T^2_Z\) is best in two regards
- It is (almost) always level-alpha
- It is more powerful than any of the other level-alpha estimators (i.e., always has error rates closer to nominal)

Simulation results: \(T^2_Z\)

Here is some evidence about the performance of the T-squared Z test.

The figure displays the range of actual type-I error rates for T-squared Z, across a variety of sample sizes and hypotheses with between 2 and 5 constraints.
The nominal rate is depicted by the solid line, and the dashed line above that is the upper bound on the monte carlo error in the simulation.
We see that the error rate is nearly always at or below the nominal level. We looked at the few cases where it is not, and these appear to be when the working model is really drastically mis-specified. And some of the cases may also be simply monte carlo error.
The other thing we note is that the test approaches the nominal error rate from below–it becomes conservative in small samples, particularly for hypotheses with more constraints. That may seem like a disadvantage (we'd certainly prefer to have exactly nominal error all the time), but it is preferable to the alternative of a liberal test. It's a bit like having Mr. Gadget's self-destructing hypothesis test–if your sample so small that the test won't provide close-to-nominal error, it will automatically self-destruct rather than risk providing a mistaken conclusion.

Example: Wilson et al. (2011)

Wilson, Lipsey, Tanner-Smith, Huang, & Steinka-Fry (2011) synthesis of effects of dropout prevention/intervention programs.
- Primary outcomes: school completion, school dropout
\(m = 152\) studies, containing 385 effect size estimates
- Some studies included effect sizes for multiple outcomes, measured on the same sample
- Some studies include effect sizes from multiple samples
Meta-regression model including several categorical moderators
- Study design: 3 levels (non-experimental, matched groups, randomized experiment)
- Outcome measure: 4 levels (school enrollment, dropout, graduation, graduation or GED)
- Evaluator independence: 4 levels (involved in delivery, involved in planning, indirect involvment, independent)
- Implementation quality: 3 levels (clear problems, possible problems, no apparent problems)
- Program format: 4 levels (community-based, classroom-based, school-based, multiple formats)

To demonstrate how these methods work in practice, we applied them to a recent systematic review by Sandra Jo Wilson and her colleagues, which looked at the effects of dropout prevention programs on rates of school completion and school dropout. The study was published in the Campbell Collaboration library. We are grateful to Sandra Jo for sharing the underlying data, which was flat out the most well organized and meticulously maintained systematic review data that I have ever encountered.
This synthesis is quite large–152 separate studies–and included a total of 385 effect size estimates. Many studies reported effect sizes for multiple outcome measures (such as school completion and school dropout) based on the same sample of participants. Some studies also reported effect sizes for multiple samples within the same study, as would occur in a multi-site trial where effects are provided for each site.
The original analysis involved a meta-regression model that included several categorical moderators: the study design, the type of outcome measure on which the effect size was calculated, the degree to which the evaluator was independent of the program delivery team, the level of quality with which the program was delivered, and the format of the program.
It is of interest to test whether each of these moderators explains variation in the effect sizes.
I should also note that the original analysis used robust variance estimation, but did not actually report any tests of these moderators. That's because the software currently available for RVE doesn't provide functionality for F testing.

Wilson et al. (2011) test results

Moderator	q	Naive F	p-value	T-squared Z	d.f.	p-value
Study design	2	0.23	0.796	0.22	43	0.800
Outcome measure	3	0.91	0.436	0.84	22	0.488
Evaluator independence	3	3.11	0.029	2.78	17	0.073
Implementation quality	2	14.15	<0.001	13.78	37	<0.001
Program format	3	3.85	0.011	3.65	38	0.021

Naive F test uses \(m - p = 130\) degrees of freedom.

Here are the results of testing each moderator using the naive F test and the T-squared Z test.

Looking first at the naive F test results, we see that the last three moderators–evaluator independence, implementation quality, and program format, are all statistically significant at conventional levels. Note that the F test uses 130 denominator degrees of freedom.
Now turning to the T-squared Z results, you'll see that the test statistic itself is a little bit smaller than the naive F, and the degrees of freedom are all much smaller than 130.
The p-values in the final column indicate that the final two moderators (implementation quality and program format) are still statistically significant, but that evaluator independence edges above the conventional cut-off, with a p-value of .073. So even in this very large meta-analysis, the small-sample corrections can make a difference.
Beyond differences in significance stars, it's also interesting to note that the five moderators have quite different degrees of freedom associated with them–ranging from as few as 17 to as many as 43 (though still not as many as 130). This is because the moderators have different degrees of imbalance and different distributions in this set of studies. In particular, evaluator independence was quite unevenly distributed across studies, which is why its degrees of freedom are the smallest of the five moderators.

Conclusions and future work

Like Tipton (in press) found with the small-sample t-test…
- The performance of the large-sample test depends on features of the underlying covariate properties.
- Consequently, it is hard to know a priori what constitutes a "big enough" sample.
We therefore recommend that small-sample corrections should always be used in practice.
We provide prototype software in R (upon request), and are working on implementing it fully into the robumeta R package and Stata macro and the metafor R package (Viechtbauer, 2010).
Future work
- Investigate power of tests based on RVE versus model-based methods.
- Investigate other areas of application beyond meta-analysis, including
  - Hierarchical linear models
  - Econometric panel data models

Our general conclusions regarding F-tests for meta-regression run in close parallel to Beth's conclusions regarding small-sample corrections for t-tests.
First, we found that the performance of the large-sample tests (the chi-squared test or the naive F test) depends strongly on features of the covariate distribution, not just on sample size.
As a result, it is hard to know a priori what constitutes a "big enough" sample to trust the large-sample tests–there's no simple rule of the form, "you can use the naive F test if you have at least X studies."
Consequently, we recommend that the small-sample correction–the T-squared Z test–should always be used in practice. This is because we are in a situation where it's not possible to determine whether you need the correction until you've gone ahead and calculated it. And if your study is large enough that the small-sample correction isn't needed, then it will make very little difference.
To make such application possible, we currently have prototype software available in R, and we are working on implementing the methods for the robumeta and metafor R packages, as well as in Stata.
To conclude, I'll mention two lines of further work to be done on these small-sample corrections. First, we noted earlier that the test can be conservative in very small samples. If you (the analyst) were in a situation where you could potentially apply model-based methods, it would be useful to understand just how much power you would sacrifice by using RVE.
Second, the small-sample corrections that we have described here may have applications beyond meta-analysis. Hypothesis tests based on cluster-robust standard errors are routinely used in many other areas of application, including hierarchical linear models and econometric panel data models, and we are currently investigating the performance of our methods in these contexts.

References

Cai, L., & Hayes, A. F. (2008). A new test of linear hypotheses in OLS regression under heteroscedasticity of unknown form. Journal of Educational and Behavioral Statistics, 33(1), 2140.

Fai, A. H.-T., & Cornelius, P. (1996). Approximate F-tests of multiple degree of freedom hypotheses in generalized least squares analyses of unbalanced split-plot experiments. Journal of Statistical Computation and Simulation, 54(4), 363378.

Hedges, L. V, Tipton, E., & Johnson, M. C. (2010). Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods, 1(1), 3965.

McCaffrey, D. F., Bell, R. M., & Botts, C. H. (2001). Generalizations of biased reduced linearization. In Proceedings of the Annual Meeting of the American Statistical Association.

Pan, W., & Wall, M. M. (2002). Small-sample adjustments in using the sandwich variance estimator in generalized estimating equations. Statistics in Medicine, 21(10), 142941.

Tipton, E. (in press). Small sample adjustments for robust variance estimation with meta-regression. Psychological Methods.

References (continued)

Wilson, S. J., Lipsey, M. W., Tanner-Smith, E., Huang, C. H., & Steinka-Fry, K. T. (2011). Dropout prevention and intervention programs: Effects on school completion and dropout Aaong school-aged children and youth: A systematic review. Campbell Systematic Reviews, 7(8).

Zhang, J.-T. (2012). An approximate Hotelling T2 -test for heteroscedastic one-way MANOVA. Open Journal of Statistics, 2, 111.

Zhang, J.-T. (2013). Tests of linear hypotheses in the ANOVA under heteroscedasticity. International Journal of Advanced Statistics and Probability, 1(2), 924.

Small-sample adjustments for tests of moderators and model fit using robust variance estimation in meta-regression

Elizabeth Tipton - tipton@tc.columbia.edu

James E. Pustejovsky - pusto@austin.utexas.edu

Additional results

Simulation results: EDT test

Comparison of small-sample corrections

plot of chunk sim_comparison