Simulating correlated standardized mean differences for meta-analysis

2019-09-30

As I’ve discussed in previous posts, meta-analyses in psychology, education, and other areas often include studies that contribute multiple, statistically dependent effect size estimates. I’m interested in methods for meta-analyzing and meta-regressing effect sizes from data structures like this, and studying this sort of thing often entails conducting Monte Carlo simulations. Monte Carlo simulations involve generating artificial data—in this case, a set of studies, each of which has one or more dependent effect size estimates—that follows a certain distributional model, applying different analytic methods to the artificial data, and then repeating the process a bunch of times. Because we know the true parameters that govern the data-generating process, we can evaluate the performance of the analytic methods in terms of bias, accuracy, hypothesis test calibration and power, confidence interval coverage, and the like.

In this post, I’ll discuss two alternative methods to simulate meta-analytic datasets that include studies with multiple, dependent effect size estimates: simulating individual participant-level data or simulating summary statistics. I’ll focus on the case of the standardized mean difference (SMD) because it is so common in meta-analyses of intervention studies. For simplicity, I’ll assume that the effect sizes all come from simple, two-group comparisons (without any covariate adjustment or anything like that) and that the individual observations are multi-variate normally distributed within each group. Our goal will be to simulate a set of \(K\) studies, where study \(k\) is based on measuring \(J_k\) outcomes on a sample of \(N_k\) participants, all for \(k = 1,...,K\). Let \(\boldsymbol\delta_k = (\delta_{1k} \cdots \delta_{J_k k})'\) be the \(J_k \times 1\) vector of true standardized mean differences for study \(k\). I’ll assume that we know these true effect size parameters for all \(K\) studies, so that I can avoid committing to any particular form of random effects model.

Simulating individual participant-level data

The most direct way to simulate this sort of effect size data is to generate outcome data for every artificial participant in every artificial study. Let \(\mathbf{Y}_{ik}^T\) be the \(J_k \times 1\) vector of outcomes for treatment group participant \(i\) in study \(k\), and let \(\mathbf{Y}_{ik}^C\) be the \(J_k \times 1\) vector outcomes for control group participant \(i\) in study \(k\), for \(i=1,...,N_k / 2\) and \(k = 1,...,K\). Assuming multi-variate normality of the outcomes, we can generate these outcome vectors as \[ \mathbf{Y}_{ik}^T \sim N\left(\boldsymbol\delta_k, \boldsymbol\Psi_k\right) \qquad \text{and}\qquad \mathbf{Y}_{ik}^C \sim N\left(\mathbf{0}, \boldsymbol\Psi_k\right), \] where \(\boldsymbol\Psi_k\) is the population correlation matrix of the outcomes in study \(k\). Note that I am setting the mean outcomes of the control group participants to zero and also specifying that the outcomes all have unit variance within each group. After simulating data based on these distributions, the effect size estimates for each outcome can be calculated directly, following standard formulas.

Here’s what this approach looks like in code. It is helpful to simplify things by focusing on simulating just a single study with multiple, correlated effect sizes. Focusing first on just the input parameters, a function might look like the following:

r_SMDs_raw <- function(delta, J, N, Psi) {
  # stuff
  return(ES_data)  
}

In the above function skeleton, delta would be the true effect size parameter \(\boldsymbol\delta_k\), J would be the number of effect sizes to generate \((J_k)\), N is the total number of participants \((N_k)\), and Psi is a matrix of correlations between the outcomes \((\Psi_k)\). From these parameters, we’ll generate raw data, calculate effect size estimates and standard errors, and return the results in a little dataset.

To make the function a little bit easier to use, I’m going overload the Psi argument so that it can be a single number, indicating a common correlation between the outcomes. Thus, instead of having to feed in a \(J_k \times J_k\) matrix, you can specify a single correlation \(r_k\), and the function will assume that all of the outcomes are equicorrelated. In code, the logic is:

if (!is.matrix(Psi)) Psi <- Psi + diag(1 - Psi, nrow = J)

Here’s the function with the innards:

r_SMDs_raw <- function(delta, J, N, Psi) {

  require(mvtnorm) # for simulating multi-variate normal data
  
  # create Psi matrix assuming equicorrelation
  if (!is.matrix(Psi)) Psi <- Psi + diag(1 - Psi, nrow = J)
  
  # generate control group summary statistics
  Y_C <- rmvnorm(n = N / 2, mean = rep(0, J), sigma = Psi)
  ybar_C <- colMeans(Y_C)
  sd_C <- apply(Y_C, 2, sd)
  
  # generate treatment group summary statistics
  delta <- rep(delta, length.out = J)
  Y_T <- rmvnorm(n = N / 2, mean = delta, sigma = Psi)
  ybar_T <- colMeans(Y_T)
  sd_T <- apply(Y_T, 2, sd)

  # calculate Cohen's d
  sd_pool <- sqrt((sd_C^2 + sd_T^2) / 2)
  ES <- (ybar_T - ybar_C) / sd_pool
  
  # calculate SE of d
  SE <- sqrt(4 / N + ES^2 / (2 * (N - 2)))

  data.frame(ES = ES, SE = SE, N = N)

}

In action:

delta <- rnorm(4, mean = 0.2, sd = 0.1)
r_SMDs_raw(delta = delta, J = 4, N = 40, Psi = 0.6)

## Loading required package: mvtnorm

##            ES        SE  N
## 1 -0.19106514 0.3169863 40
## 2  0.18427227 0.3169334 40
## 3  0.25646209 0.3175932 40
## 4  0.00210429 0.3162279 40

Or if you’d rather specify the full \(\Psi_k\) matrix yourself:

Psi_k <- 0.6 + diag(0.4, nrow = 4)
Psi_k

##      [,1] [,2] [,3] [,4]
## [1,]  1.0  0.6  0.6  0.6
## [2,]  0.6  1.0  0.6  0.6
## [3,]  0.6  0.6  1.0  0.6
## [4,]  0.6  0.6  0.6  1.0

r_SMDs_raw(delta = delta, J = 4, N = 40, Psi = Psi_k)

##           ES        SE  N
## 1 -0.1597097 0.3167580 40
## 2 -0.1717717 0.3168410 40
## 3 -0.4369032 0.3201744 40
## 4  0.0657410 0.3163177 40

Exercises

The function above is serviceable but quite basic. I can think of several additional features that one might like to have for use in research simulations, but I’m feeling both cheeky and lazy at the moment, so I’ll leave them for you, dear reader. Here are some suggested exercises:

Add an argument to the function, Hedges_g = TRUE, which controls where the simulated effect size is Hedges’ \(g\) or Cohen’s \(d\). If it is Hedges’ g, make sure that the standard error is corrected too.
Add an argument to the function, p_val = TRUE, which allows the user to control whether or not to return \(p\)-values from the test of mean differences for each outcome. Note that the p-values should be for a test of the raw mean differences between groups, rather than a test of the effect size \(\delta_{jk} = 0\).
Add an argument to the function, corr_mat = FALSE, which controls whether the function returns just the simulated effect sizes and SEs or both the simulated effect sizes and the full sampling variance-covariance matrix of the effect sizes. See here for the relevant formulas.

Simulating summary statistics

Another approach to simulating SMDs is to sample from the distribution of the summary statistics used in calculating the effect size. This approach should simplify the code, at the cost of having to use a bit of distribution theory. Let \(\mathbf{\bar{y}}_{Tk}\) and \(\mathbf{\bar{y}}_{Ck}\) be the \(J_k \times 1\) vectors of sample means for the treatment and control groups, respectively. Let \(\mathbf{S}_k\) be the \(J_k \times J_k\) sample covariance matrix of the outcomes, pooled across the treatment and control groups. Again assuming multi-variate normality, and following the same notation as above: \[ \mathbf{\bar{y}}_{Ck} \sim N\left(\mathbf{0}, \frac{2}{N_k} \boldsymbol\Psi_k\right), \qquad \mathbf{\bar{y}}_{Tk} \sim N\left(\boldsymbol\delta_k, \frac{2}{N_k} \boldsymbol\Psi_k\right), \] and \[ \left(\mathbf{\bar{y}}_{Tk} - \mathbf{\bar{y}}_{Ck}\right) \sim N\left(\boldsymbol\delta_k, \frac{4}{N_k} \boldsymbol\Psi_k\right). \] This shows how we could directly simulate the numerator of the standardized mean difference.

A further bit of distribution theory says that the pooled sample covariance matrix follows a multiple of a Wishart distribution with \(N_k - 2\) degrees of freedom and scale matrix \(\Psi_k\): \[ (N_k - 2) \mathbf{S}_k \sim Wishart\left(N_k - 2, \Psi_k \right). \] Thus, to simulate the denominators of the SMD estimates, we can simulate a single Wishart matrix, pull out the diagonal entries, divide by \(N_k - 2\), and take the square root. In all, we draw a single \(J_k \times 1\) observation from a multi-variate normal distribution and a single \(J_k \times J_k\) observation from a Wishart distribution. In contrast, the raw data approach requires simulating \(N_k\) observations from a multi-variate normal distribution, then calculating \(4 J_k\) summary statistics (M and SD for each group on each outcome).

Exercises

Once again, I’ll leave it to you, dear reader, to do the fun programming bits:

Create a modified version of the function r_SMDs_raw that simulates summary statistics instead of raw data (Call it r_SMDs_stats).
Use the microbenchmark package (or your preferred benchmarking tool) to compare the computational efficiency of both versions of the function.
Check your work! Verify that both versions of the function generate the same distributions if the same parameters are used as input.

Which approach is better?

Like many things in research, there’s no clearly superior method here. The advantage of the summary statistics approach is computational efficiency. It should generally be faster than the raw data approach, and if you need to generate 10,000 meta-analysis each with 80 studies in them, the computational savings might add up. On the other hand, computational efficiency isn’t everything.

I see two potential advantages of the raw data approach. First is interpretability: simulating raw data is likely easier to understand. It feels tangible and familiar, harkening back to those bygone days we spent learning ANOVA, whereas the summary statistics approach requires a bit of distribution theory to follow (bookmark this blog post!). Second is extensibility: it is relatively straightforward to extend the approach to use other distributional models for the raw dat (perhaps you want to look at outcomes that follow a multi-variate \(t\) distribution?) or more complicated estimators of the SMD (difference-in-differences? covariate-adjusted? cluster-randomized trial?). To use the summary statistics approach in more complicated scenarios, you’d have to work out the sampling distributions for yourself, or locate the right reference.

Of course, there’s also no need to choose between these two approaches. As I’m trying to hint at in Exercise 6, it’s actually useful to write both. Then, you can use the (potentially slower) raw data version to verify that the summary statistics version is correct.

Simulating full meta-analyses

So far we’ve got a data-generating function that simulates a single study’s worth of effect size estimates. To study meta-analytic methods, we’ll need to build out the function to simulate multiple studies. To do so, I think it’s useful to use the technique of mapping, as implemented in the purrr package’s map_* functions. The idea here is to first generate a “menu” of study-specific parameters for each of \(K\) studies, then apply the r_SMDs function to each parameter set.

Let’s consider how to do this for a simple random effects model, where the true effect size parameter is constant within each study (i.e., \(\boldsymbol\delta_k = (\delta_k \cdots \delta_k)'\)), and in a model without covariates. We’ll need to generate a true effect for each study, along with a sample size, an outcome dimension, and a correlation between outcomes. For the true effects, I’ll assume that \[ \delta_k \sim N(\mu, \tau^2), \] \[ J_k \sim 2 + Poisson(3), \] \[ N_k \sim 20 + 2 \times Poisson(10), \] and \[ r_k \sim Beta\left(\rho \nu, (1 - \rho)\nu\right), \] where \(\rho = \text{E}(r_k)\) and \(\nu > 0\) controls the variability of \(r_k\) across studies, with smaller \(\nu\) corresponding to more variable correlations. Specifically, \(\text{Var}(r_k) = \rho (1 - \rho) / (1 + \nu)\). These distributions are just made up, without any particular justification.

Here’s what these distributional models look like in R code:

K <- 6
mu <- 0.2
tau <- 0.05
J_mean <- 5
N_mean <- 45
rho <- 0.6
nu <- 39

study_data <- 
  data.frame(
    delta = rnorm(K, mean = mu, sd = tau),
    J = 2 + rpois(K, J_mean - 2),
    N = 20 + 2 * rpois(K, (N_mean - 20) / 2),
    Psi = rbeta(K, rho * nu, (1 - rho) * nu)
  )

study_data

##       delta J  N       Psi
## 1 0.1749657 6 56 0.6670410
## 2 0.1371771 4 52 0.7952095
## 3 0.1430044 2 46 0.5551301
## 4 0.1953675 6 46 0.5339670
## 5 0.1653242 4 42 0.5623903
## 6 0.1419457 7 40 0.6615825

Once we have the “menu” of study-level characteristics, it’s just a matter of mapping the parameters to the data-generating function. One way to do this is with pmap_df:

library(purrr)
meta_data <- pmap_df(study_data, r_SMDs_raw, .id = "study")
meta_data

##    study           ES        SE  N
## 1      1  0.427048814 0.2704019 56
## 2      1  0.206502285 0.2679989 56
## 3      1  0.270244756 0.2685234 56
## 4      1  0.423149362 0.2703451 56
## 5      1  0.525878094 0.2720096 56
## 6      1  0.746186579 0.2767383 56
## 7      2 -0.005809721 0.2773507 52
## 8      2 -0.082222645 0.2774719 52
## 9      2  0.114670949 0.2775871 52
## 10     2 -0.001432641 0.2773501 52
## 11     3 -0.031231291 0.2949027 46
## 12     3  0.302264458 0.2966391 46
## 13     4  0.085338908 0.2950242 46
## 14     4 -0.062511255 0.2949592 46
## 15     4 -0.040178730 0.2949150 46
## 16     4 -0.082519741 0.2950151 46
## 17     4  0.207953122 0.2957160 46
## 18     4 -0.005713721 0.2948845 46
## 19     5  0.293666394 0.3103483 42
## 20     5  0.258312309 0.3099551 42
## 21     5  0.362126706 0.3112512 42
## 22     5  0.177656049 0.3092452 42
## 23     6 -0.115158991 0.3165035 40
## 24     6  0.094349350 0.3164129 40
## 25     6 -0.052996601 0.3162862 40
## 26     6 -0.042766762 0.3162658 40
## 27     6 -0.314584445 0.3182800 40
## 28     6  0.078519103 0.3163560 40
## 29     6 -0.103034241 0.3164486 40

table(meta_data$study)

## 
## 1 2 3 4 5 6 
## 6 4 2 6 4 7

Putting it all together into a function, we have

r_meta <- function(K, mu, tau, J_mean, N_mean, rho, nu) {
  require(purrr)
  
  study_data <- 
    data.frame(
      delta = rnorm(K, mean = mu, sd = tau),
      J = 2 + rpois(K, J_mean - 2),
      N = 20 + 2 * rpois(K, (N_mean - 20) / 2),
      Psi = rbeta(K, rho * nu, (1 - rho) * nu)
    )
  
  pmap_df(study_data, r_SMDs_raw, .id = "study")
}

Exercises

Modify r_meta so that it uses r_SMDs_stats.
Add options to r_meta for Hedges_g, p_val = TRUE, and corr_mat = FALSE and ensure that these get passed along to the r_SMDs function.
One way to check that the r_meta function is working properly is to generate a very large meta-analytic dataset, then to verify that the generated distributions align with expectations. Here’s a very large meta-analytic dataset:
```
meta_data <- 
  r_meta(100000, mu = 0.2, tau = 0.05, 
         J_mean = 5, N_mean = 40, 
         rho = 0.6, nu = 39)
```
Compare the distribution of the simulated dataset against what you would expect to get based on the input parameters.
Modify the r_meta function so that \(J_k\) and \(N_k\) are correlated, according to \[ \begin{align} J_k &\sim 2 + Poisson(\mu_J - 2) \\ N_k &\sim 20 + 2 \times Poisson\left(\frac{1}{2}(\mu_N - 20) + \alpha (J_k - \mu_J) \right) \end{align} \] for user-specified values of \(\mu_J\) (the average number of outcomes per study), \(\mu_N\) (the average total sample size per study), and \(\alpha\), which controls the degree of dependence between \(J_k\) and \(N_k\).

A challenge

The meta-analytic model that we’re using here is quite simple—simplistic, even—and for some simulation studies, something more complex might be needed. For example, we might need to generate data from a model that includes within-study random effects, as in: \[ \delta_{jk} = \mu + u_k + v_{jk}, \quad \text{where}\quad u_k \sim N(0, \tau^2), \quad v_{jk} \sim N(0, \omega^2). \] Even more complex would be to simulate from a multi-level meta-regression model \[ \delta_{jk} = \mathbf{x}_{jk} \boldsymbol\beta + u_k + v_{jk}, \quad \text{where}\quad u_k \sim N(0, \tau^2), \quad v_{jk} \sim N(0, \omega^2), \] where \(\mathbf{x}_{jk}\) is a \(1 \times p\) row-vector of covariates describing outcome \(j\) in study \(k\) and \(\boldsymbol\beta\) is a \(p \times 1\) vector of meta-regression coefficients. In past work, I’ve done this by writing a data-generating function that takes a fixed design matrix \(\mathbf{X} = \left(\mathbf{x}_{11}' \cdots \mathbf{x}_{J_K K}'\right)'\) as an input argument, along with \(\boldsymbol\beta\). The design matrix would also include an identifier for each unique study. There are surely better (simpler, easier to follow) ways to implement the multi-level meta-regression model. I’ll once again leave it to you to work out an approach.

Simulating correlated standardized mean differences for meta-analysis

Simulating individual participant-level data

Exercises

Simulating summary statistics

Exercises

Which approach is better?

Simulating full meta-analyses

Exercises

A challenge

Related