Alternative formulas for the standardized mean difference

2016-06-03

The standardized mean difference (SMD) is surely one of the best known and most widely used effect size metrics used in meta-analysis. In generic terms, the SMD parameter is defined as the difference in population means between two groups (often this difference represents the effect of some intervention), scaled by the population standard deviation of the outcome metric. Estimates of the SMD can be obtained from a wide variety of experimental designs, ranging from simple, completely randomized designs, to repeated measures designs, to cluster-randomized trials.

There’s some nuance involved in figuring out how to calculate estimates of the SMD from each design, mostly to do with exactly what sort of standard deviation to use in the denominator of the effect size. I’ll leave that discussion for another day. Here, I’d like to look at the question of how to estimate the sampling variance of the SMD. An estimate of the sampling variance is needed in order to meta-analyze a collection of effect sizes, and so getting the variance calculations right is an important (and sometimes time consuming) part of any meta-analysis project. However, the standard textbook treatments of effect size calculations cover this question only for a limited number of simple cases. I’d like to suggest a different, more general way of thinking about it, which provides a way to estimate the SMD and its variance in some non-standard cases (and also leads to slight differences from conventional formulas for the standard ones). All of this will be old hat for seasoned synthesists, but I hope it might be useful for students and researchers just getting started with meta-analysis.

To start, let me review (regurgitate?) the standard presentation.

SMD from a simple, independent groups design

Textbook presentations of the SMD estimator almost always start by introducing the estimator in the context of a simple, independent groups design. Call the groups T and C, the sample sizes \(n_T\) and \(n_C\), the sample means \(\bar{y}_T\) and \(\bar{y}_C\), and the sample variances \(s_T^2\) and \(s_C^2\). A basic moment estimator of the SMD is then

\[ d = \frac{\bar{y}_T - \bar{y}_C}{s_p} \]

where \(s_p^2 = \frac{\left(n_T - 1\right)s_T^2 + \left(n_C - 1\right) s_C^2}{n_T + n_C - 2}\) is a pooled estimator of the population variance. The standard estimator for the sampling variance of \(d\) is

\[ V_d = \frac{n_T + n_C}{n_T n_C} + \frac{d^2}{2\left(n_T + n_C - 2\right)}, \]

or some slight variant thereof. This estimator is based on a delta-method approximation for the asymptotic variance of \(d\).

It is well known that \(d\) has a small sample bias that depends on sample sizes. Letting

\[ J(x) = 1 - \frac{3}{4x - 1}, \]

the bias-corrected estimator is

\[ g = J\left(n_T + n_C - 2\right) \times d, \]

and is often referred to as Hedges’ \(g\) because it was proposed in Hedges (1981). Some meta-analysts use \(V_d\), but with \(d^2\) replaced by \(g^2\), as an estimator of the large-sample variance of \(g\); others use

\[ V_g = J^2\left(n_T + n_C - 2\right) \left(\frac{n_T + n_C}{n_T n_C} + \frac{g^2}{2\left(n_T + n_C - 2\right)}\right). \]

Viechtbauer (2007) provides further details on variance estimation and confidence intervals for the SMD in this case.

A general formula for \(g\) and its sampling variance

The above formulas are certainly useful, but in practice meta-analyses often include studies that use other, more complex designs. Good textbook presentations also cover computation of \(g\) and its variance for some other cases (e.g., Borenstein, 2009, also covers one-group pre/post designs and analysis of covariance). Less careful presentations only cover the simple, independent groups design and thus may inadvertently leave the impression that the variance estimator \(V_d\) given above applies in general. With other types of studies, \(V_d\) can be a wildly biased estimator of the actual sampling variance of \(d\), because it is derived under the assumption that the numerator of \(d\) is estimated as the difference in means of two simple random samples. In some designs (e.g., ANCOVA designs, randomized block designs, repeated measures designs), the treatment effect estimate will be much more precise than this; in other designs (e.g., cluster-randomized trials), it will be less precise.

Here’s what I think is a more useful way to think about the sampling variance of \(d\). Let’s suppose that we have an unbiased estimator for the difference in means that goes into the numerator of the SMD. Call this estimator \(b\), its sampling variance \(\text{Var}(b)\), and its standard error \(se_{b}\). Also suppose that we have an unbiased (or reasonably close-to-unbiased) estimator of the population variance of the outcome, the square root of which goes into the denominator of the SMD. Call this estimator \(S^2\), with expectation \(\text{E}\left(S^2\right) = \sigma^2\) and sampling variance \(\text{Var}(S^2)\). Finally, suppose that \(b\) and \(S^2\) are independent (which will often be a pretty reasonable assumption). A delta-method approximation for the sampling variance of \(d = b / S\) is then

\[ \text{Var}\left(d\right) \approx \frac{\text{Var}(b)}{\sigma^2} + \frac{\delta^2}{2 \nu}, \]

where \(\nu = 2 \left[\text{E}\left(S^2\right)\right]^2 / \text{Var}\left(S^2\right)\). Plugging in sample estimates of the relevant parameters provides a reasonable estimator for the sampling variance of \(d\):

\[ V_d = \left(\frac{se_b}{S}\right)^2 + \frac{d^2}{2 \nu}. \]

This estimator has two parts. The first part involves \(se_b / S\), which is just the standard error of \(b\), but re-scaled into standard deviation units; this part captures the variability in \(d\) from its numerator. This scaled standard error can be calculated directly if an article reports \(se_b\).

The second part of \(V_d\) is \(d^2 / (2 \nu)\), which captures the variability in \(d\) due to its denominator. More precise estimates of \(\sigma\) will have larger degrees of freedom, so that the second part will be smaller. For some designs, the degrees of freedom \(\nu\) depend only on sample sizes, and thus can be calculated exactly. For some other designs, \(\nu\) must be estimated.

The same degrees of freedom can also be used in the small-sample correction for the bias of \(d\), as given by

\[ g = J(\nu) \times d. \]

This small-sample correction is based on a Satterthwaite-type approximation to the distribution of \(d\).

Here’s another way to express the variance estimator for \(d\):

\[ V_d = d^2 \left(\frac{1}{t^2} + \frac{1}{2 \nu}\right), \]

where \(t\) is the test statistic corresponding to the hypothesis test for no difference between groups. I’ve never seen that formula in print before, but it could be convenient if an article reports the \(t\) statistic (or \(F = t^2\) statistic).

Non-standard estimators of \(d\)

The advantage of this formulation of \(d\), \(g\), and \(V_d\) is that it can be applied in quite a wide variety of circumstances, including cases that aren’t usually covered in textbook treatments. Rather than having to use separate formulas for every combination of design and analytic approach under the sun, the same formulas apply throughout. What changes are the components of the formulas: the scaled standard error \(se_b / S\) and the degrees of freedom \(\nu\). The general formulation also makes it easier to swap in different estimates of \(b\) or \(S\)—i.e., if you estimate the numerator a different way but keep the denominator the same, you’ll need a new scaled standard error but can still use the same degrees of freedom. A bunch of examples:

Independent groups with different variances

Suppose that we’re looking at two independent groups but do not want to assume that their variances are the same. In this case, it would make sense to standardize the difference in means by the control group standard deviation (without pooling), so that \(d = \left(\bar{y}_T - \bar{y}_C\right) / s_C\). Since \(s_C^2\) has \(\nu = n_C - 1\) degrees of freedom, the small-sample bias correction will then need to be \(J(n_C - 1)\). The scaled standard error will be

\[ \frac{se_b}{s_C} = \sqrt{\frac{s_T^2}{s_C^2 n_T} + \frac{1}{n_C}}. \]

This is then everything that we need to calculate \(V_d\), \(g\), \(V_g\), etc.

Multiple independent groups

Suppose that the study involves \(K - 1\) treatment groups, 1 control group, and \(N\) total participants. If the meta-analysis will include SMDs comparing each treatment group to the control group, it would make sense to pool the sample variance across all \(K\) groups rather than just the pair of groups, so that a common estimate of scale is used across all the effect sizes. The pooled standard deviation is then calculated as

\[ s_p^2 = \frac{1}{N - K} \sum_{k=0}^K (n_k - 1) s_k^2. \]

For a comparison between treatment group \(k\) and the control group, we would then use

\[ d = \frac{\bar{y}_k - \bar{y}_C}{s_p}, \qquad \nu = N - K, \qquad \frac{se_b}{s_p} = \sqrt{\frac{1}{n_C} + \frac{1}{n_k}}, \]

where \(n_k\) is the sample size for treatment group \(k\) (cf. Gleser & Olkin, 2009).

Single group, pre-test post-test design

Suppose that a study involves taking pre-test and post-test measurements on a single group of \(n\) participants. Borenstein (2009) recommends calculating the standardized mean difference for this study as the difference in means between the post-test and pre-test, scaled by the pooled (across pre- and post-test measurements) standard deviation. With obvious notation:

\[ d = \frac{\bar{y}_{post} - \bar{y}_{pre}}{s_p}, \qquad \text{where} \qquad s_p^2 = \frac{1}{2}\left(s_{pre}^2 + s_{post}^2\right). \]

In this design,

\[ \frac{se_b}{s_p} = \sqrt{\frac{2(1 - r)}{n}}, \]

where \(r\) is the sample correlation between the pre- and post-tests. The remaining question is what to use for \(\nu\). Borenstein (2009) uses \(\nu = n - 1\). My previous post on the sampling covariance of sample variances gave the result that \(\text{Var}(s_p^2) = \sigma^4 (1 + \rho^2) / (n - 1)\), which would instead suggest using

\[ \nu = \frac{2 (n - 1)}{1 + r^2}. \]

This formula will tend to give slightly larger degrees of freedom, but probably won’t be that discrepant from Borenstein’s approach except in quite small samples. It would be interesting to investigate which approach is better in small samples (i.e., leading to less biased estimates of the SMD and more accurate estimates of sampling variance, and by how much), although its possible than neither is all that good because the variance estimator itself is based on a large-sample approximation.

Two group, pre-test post-test design: ANCOVA estimation

Suppose that a study involves taking pre-test and post-test measurements on two groups of participants, with sample sizes \(n_T\) and \(n_C\) respectively. One way to analyze this design is via ANCOVA using the pre-test measure as the covariate, so that the treatment effect estimate is the difference in adjusted post-test means. In this design, the scaled standard error will be approximately

\[ \frac{se_b}{S} = \sqrt{ \frac{(n_C + n_T)(1 - r^2)}{n_C n_T} }, \]

where \(r\) is the pooled, within-group sample correlation between the pre-test and the post-test measures (this approximation assumes that the pre-test SMD between groups is relatively small). Alternately, if \(se_b\) is provided then the scaled standard error could be calculated directly.

Borenstein (2009) suggests calculating \(d\) as the difference in adjusted means, scaled by the pooled sample variances on the post-test measures. The post-test pooled sample variance will have the same degrees of freedom as in the two-sample t-test case: \(\nu = n_C + n_T - 2\). (Borenstein instead uses \(\nu = n_C + n_T - 2 - q\), where \(q\) is the number of covariates in the analysis, but this won’t usually make much difference unless the total sample size is quite small.)

Scaling by the pooled post-test sample variance isn’t the only reasonable way to estimate the SMD though. If the covariate is a true pre-test, then why not scale by the pooled pre-test sample variance instead? To do so, you would need to calculate \(se_b / S\) directly and use \(\nu = n_C + n_T - 2\). If it is reasonable to assume that the pre- and post-test population variances are equal, then another alternative would be to pool across the pre-test and post-test sample variances in each group. Using this approach, you would again need to calculate \(se_b / S\) directly and then use \(\nu = 2(n_C + n_T - 2) / (1 + r^2)\).

Two group, pre-test post-test design: repeated measures estimation

Another way to analyze the data from the same type of study design is to use repeated measures ANOVA. I’ve recently encountered a number of studies that use this approach (here’s a recent example from a highly publicized study in PLOS ONE—see Table 2). The studies I’ve seen typically report the sample means and variances in each group and at each time point, from which the difference in change scores can be calculated. Let \(\bar{y}_{gt}\) and \(s_{gt}^2\) denote the sample mean and sample variance in group \(g = T, C\) at time \(t = 0, 1\). The numerator of \(d\) would then be calculated as

\[ b = \left(\bar{y}_{T1} - \bar{y}_{T0}\right) - \left(\bar{y}_{C1} - \bar{y}_{C0}\right), \]

which has sampling variance \(\text{Var}(b) = 2(1 - \rho)\sigma^2\left(n_C + n_T \right) / (n_C n_T)\), where \(\rho\) is the correlation between the pre-test and the post-test measures. Thus, the scaled standard error is

\[ \frac{se_b}{S} = \sqrt{\frac{2(1 - r)(n_C + n_T)}{n_C n_T}}. \]

As with ANCOVA, there are several potential options for calculating the denominator of \(d\):

Using the pooled sample variances on the post-test measures, with \(\nu = n_C + n_T - 2\);
Using the pooled sample variances on the pre-test measures, with \(\nu = n_C + n_T - 2\); or
Using the pooled sample variances at both time points and in both groups, i.e.,

\[ S^2 = \frac{(n_C - 1)(s_{C0}^2 + s_{C1}^2) + (n_T - 1)(s_{T0}^2 + s_{T1}^2)}{2(n_C + n_T - 2)}, \]

with \(\nu = 2(n_C + n_T - 2) / (1 + r^2)\).

The range of approaches to scaling is the same as for ANCOVA. This makes sense because both analyses are based on data from the same study design, so the parameter of interest should be the same (i.e., the target parameter should not change based on the analytic method). Note that all of these approaches are a bit different than the effect size estimator proposed by Morris and DeShon (2002) for the two-group, pre-post design; their approach does not fit into my framework because it involves taking a difference between standardized effect sizes (and therefore involves two separate estimates of scale, rather than just one).

Randomized trial with longitudinal follow-up

Many independent-groups designs—especially randomized trials in field settings—involve repeated, longitudinal follow-up assessments. An increasingly common approach to analysis of such data is through hierarchical linear models, which can be used to account for the dependence structure among measurements taken on the same individual. In this setting, Feingold (2009) proposes that the SMD be calculated as the model-based estimate of the treatment effect at the final follow-up time, scaled by the within-groups variance of the outcome at that time point. Let \(\hat\beta_1\) denote the estimated difference in slopes (change per unit time) between groups in a linear growth model, \(F\) denote the duration of the study, and \(s_{pF}^2\) denote the pooled sample variance of the outcome at the final time point. For this model, Feingold (2009) proposes to calculate the standardized mean difference as

\[ d = \frac{F \hat\beta_1}{s_{pF}}. \]

In a later paper, Feingold (2015) proposes that the sampling variance of \(d\) be estimated as \(F \times se_{\hat\beta_1} / s_{pF}\), where \(se_{\hat\beta_1}\) is the standard error of the estimated slope. My framework suggests that a better estimate of the sampling variance, which accounts for the uncertainty of the scale estimate, would be to use

\[ V_d = \left(\frac{F \times se_{\hat\beta_1}}{s_{pF}}\right)^2 + \frac{d^2}{2 \nu}, \]

with \(\nu = n_T + n_C - 2\). The same \(\nu\) could be used to bias-correct the effect size estimate.

If estimates of the variance components of the HLM are reported, one could use them to construct a model-based estimate of the scale parameter in the denominator of \(d\). I explored this approach in a paper that uses HLM to model single-case designs, which are a certain type of longitudinal experiment that typically involve a very small number of participants (Pustejovsky, Hedges, & Shadish, 2014). Estimates of the scale parameter can usually be written as

\[ S_{model}^2 = \mathbf{r}'\boldsymbol\omega, \]

where \(\boldsymbol\omega\) is a vector of all the variance components in the model and \(\mathbf{r}\) is a vector of weights that depend on the model specification and length of follow-up. This estimate of scale will usually be more precise than \(s_{pF}^2\) because it makes use of all of the data (and modeling assumptions). However, it can be challenging to determine appropriate degrees of freedom for \(S_{model}^2\). For single-case designs, I used estimates of \(\text{Var}(\boldsymbol\omega)\) based on the inverse of the expected information matrix—call the estimate \(\mathbf{V}_{\boldsymbol\omega}\)—in which case

\[ \nu = \frac{2 S_{model}^4}{\mathbf{r}' \mathbf{V}_{\boldsymbol\omega} \mathbf{r}}. \]

However, most published articles will not provide estimates of the sampling variances of the variance components—in fact, a lot of software for estimating HLMs does not even provide these. It would be useful to work out some reasonable approximations for the degrees of freedom in these models—approximations that can be calculated based on the information that’s typically available—and to investigate the extent to which there’s any practical benefit to using \(S_{model}^2\) over \(s_{pF}^2\).

Cluster-randomized trials

Hedges (2007) addresses estimation of standardized mean differences for cluster-randomized trials, in which the units of measurement are nested within higher-level clusters that comprise the units of randomization. Such designs involve two variance components (within- and between-cluster variance), and thus there are three potential approaches to scaling the treatment effect: standardize by the total variance (i.e., the sum of the within- and between-cluster components), standardize by the within-cluster variance, or standardize by the between-cluster variance. Furthermore, some of the effect sizes can be estimated in several different ways, each with a different sampling variance. Hedges (2007) gives sampling variance estimates for each estimator of each effect size, but they all follow the same general formula as given above. (The appendix of the article actually gives the same formula as above, but using a more abstract formulation.)

For example, suppose the target SMD parameter uses the total variance and that we have data from a two-level, two-arm cluster randomized trial with \(M\) clusters, \(n\) observations per cluster, and total sample sizes in each arm of \(N_T\) and \(N_C\), respectively. Let \(\tau^2\) be the between-cluster variance, \(\sigma^2\) be the within-cluster variance, and \(\rho = \tau^2 / (\tau^2 + \sigma^2)\). The target parameter is \(\delta = \left(\mu_T - \mu_C\right) / \left(\tau^2 + \sigma^2\right)\). The article assumes that the treatment effect will be estimated by the difference in grand means, \(\bar{\bar{y}}_T - \bar{\bar{y}}_C\). Letting \(S_B^2\) be the pooled sample variance of the cluster means within each arm and \(S_W^2\) be the pooled within-cluster sample variance, the total variance is estimated as

\[ S_{total}^2 = S_B^2 + \frac{n - 1}{n} S_W^2. \]

An estimate of the SMD is then

\[ d = \left(\bar{\bar{y}}_T - \bar{\bar{y}}_C \right) / \sqrt{S_{total}^2}. \]

The scaled standard error of \(\bar{\bar{y}}_T - \bar{\bar{y}}_C\) is

\[ se_b = \sqrt{\left(\frac{N_C + N_T}{N_C N_T}\right)\left[1 + (n - 1)\rho\right]}. \]

The appendix of the article demonstrates that \(\text{E}\left(S_{total}^2\right) = \tau^2 + \sigma^2\) and

\[ \text{Var}\left( S_{total}^2 \right) = \frac{2}{n^2}\left(\frac{(n \tau^2 + \sigma^2)^2}{M - 2} + \frac{(n - 1)^2 \sigma^4}{N_C + N_T - M}\right), \]

by which it follows that

\[ \nu = \frac{n^2 M (M - 2)}{M[(n - 1)\rho + 1]^2 + (M - 2)(n - 1)(1 - \rho)^2}. \]

Substituting \(se_b / S_{total}\) and \(\nu\) into the formula for \(V_d\) gives the same as Expression (14) in the article.

A limitation of Hedges (2007) is that it only covers the case where the treatment effect is estimated by the difference in grand means (although it does cover the case of unequal cluster sizes, which gets quite messy). In practice, every cluster-randomized trial I’ve ever seen uses baseline covariates to adjust the mean difference (often based on a hierarchical linear model) and improve the precision of the treatment effect estimate. The SMD estimate should also be based on this covariate-adjustment estimate, scaled by the total variance without adjusting for the covariate. An advantage of the general formulation given above is that its clear how to estimate the sampling variance of \(d\). I would guess that it will often be possible to calculate the scaled standard error directly, given the standard error of the covariate-adjusted treatment effect estimate. And since \(S_{total}\) would be estimated just as before, its degrees of freedom remain the same.

Hedges (2011) discusses estimation of SMDs in three-level cluster-randomized trials—an even more complicated case. However, the general approach is the same; all that’s needed are the scaled standard error and the degrees of freedom \(\nu\) of whatever combination of variance components go into the denominator of the effect size. In both the two-level and three-level cases, the degrees of freedom get quite complicated in unbalanced samples and are probably not calculable from the information that is usually provided in an article. Hedges (2007, 2011) comments on a couple of cases where more tractable approximations can be used, although it seems like there might be room for further investigation here.

Closing thoughts

I think this framework is useful in that it unifies a large number of cases that have been treated separately, and can also be applied (more-or-less immediately) to \(d\) estimators that haven’t been widely considered before, such as the \(d\) that involves scaling by the pooled pre-and-post, treatment-and-control sample variance. I hope it also illustrates that, while the point estimator \(d\) can be applied across a large number of study designs, the sampling variance of \(d\) depends on the details of the design and estimation methods. The same is true for other families of effect sizes as well. For example, in other work I’ve demonstrated that the sampling variance of the correlation coefficient depends on the design from which the correlations are estimated (Pustejovsky, 2014).

If you have read this far, I’d love to get your feedback about whether you think this is a useful way to organize the calculations of \(d\) estimators. Is this helpful? Or nothing you didn’t already know? Or still more complicated than it should be? Leave a comment!

References

Borenstein, M. (2009). Effect sizes for continuous data. In H. M. Cooper, L. V Hedges, & J. C. Valentine (Eds.), The Handbook of Research Synthesis and Meta-Analysis (pp. 221–236). New York, NY: Russell Sage Foundation.

Feingold, A. (2009). Effect sizes for growth-modeling analysis for controlled clinical trials in the same metric as for classical analysis. Psychological Methods, 14(1), 43–53. doi:10.1037/a0014699

Feingold, A. (2015). Confidence interval estimation for standardized effect sizes in multilevel and latent growth modeling. Journal of Consulting and Clinical Psychology, 83(1), 157–168. doi:10.1037/a0037721

Gleser, L. J., & Olkin, I. (2009). Stochastically dependent effect sizes. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The Handbook of Research Synthesis and Meta-Analysis (2nd ed., pp. 357–376). New York, NY: Russell Sage Foundation.

Hedges, L. V. (2007). Effect sizes in cluster-randomized designs. Journal of Educational and Behavioral Statistics, 32(4), 341–370. doi:10.3102/1076998606298043

Hedges, L. V. (2011). Effect sizes in three-level cluster-randomized experiments. Journal of Educational and Behavioral Statistics, 36(3), 346–380. doi:10.3102/1076998610376617

Morris, S. B., & DeShon, R. P. (2002). Combining effect size estimates in meta-analysis with repeated measures and independent-groups designs. Psychological Methods, 7(1), 105–125. doi:10.1037//1082-989X.7.1.105

Pustejovsky, J. E. (2014). Converting from d to r to z when the design uses extreme groups, dichotomization, or experimental control. Psychological Methods, 19(1), 92–112. doi:10.1037/a0033788

Pustejovsky, J. E., Hedges, L. V, & Shadish, W. R. (2014). Design-comparable effect sizes in multiple baseline designs: A general modeling framework. Journal of Educational and Behavioral Statistics, 39(5), 368–393. doi:10.3102/1076998614547577

Viechtbauer, W. (2007). Approximate confidence intervals for standardized effect sizes in the two-independent and two-dependent samples design. Journal of Educational and Behavioral Statistics, 32(1), 39–60. doi:10.3102/1076998606298034