September 28, 2016

Simulation

  1. Generate data from a probability model where you know the true parameters.
  2. Apply one or more estimation procedures to the data, record the results.
  3. Repeat many times ("in repeated samples…").
  4. Compare the results to the truth (bias, accuracy, error/coverage rates).

Why is simulation useful?

  • Building understanding & intuition about statistical models
  • Assessing/checking modeling strategies
  • Formal research

Modeling batting averages

h/t David Robinson for this example (more details on his blog)

  • Baseball batting average = # hits / # at-bats
  • Correlated with # at-bats because better hitters get more opportunities
  • Beta-binomial regression is a useful way to model this relationship

Batting averages vs. at-bats

Beta-binomial regression

  • \(Y_i\) is number of hits by player \(i\)
  • \(n_i\) is number of at-bats
  • \(Y_i \sim Binomial(n_i, \pi_i)\), where \(p_i\) is true batting ability of player \(i\)
  • \(\pi_i \sim Beta\left(\mu_{n_i} / \sigma, (1 - \mu_{n_i}) / \sigma\right)\), where \(\mu_{n_i}\) is average batting ability of players with \(n_i\) at-bats and \(\sigma\) describes the variability in true ability.
  • \(\mu_{n_i} = \beta_0 + \beta_1 \log(n_i)\)

Simulate to build understanding

set.seed(20160928)
players <- 1000
at_bats <- sample(career_at_bats, size = players)
summary(at_bats)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00    30.75   299.00  1302.00  1730.00 10200.00
mu <- 0.140 + 0.015 * log(at_bats)
summary(mu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1400  0.1914  0.2255  0.2195  0.2518  0.2784

Simulate true abilities

sigma <- 1 / 500
ability <- rbeta(players, 
                 shape1 = mu / sigma, 
                 shape2 = (1 - mu) / sigma)

Simulate true abilities

Simulate batting averages

dat$hits <- with(dat, rbinom(n = players, size = at_bats, prob = ability))
dat$batting_avg <- with(dat, hits / at_bats)

Simulate batting averages

Fit the beta-binomial regression

library(gamlss)
bb_fit <- gamlss(cbind(hits, at_bats - hits) ~ log(at_bats),
                 data = dat, family = BB(mu.link = "identity"))
coef(bb_fit)
##  (Intercept) log(at_bats) 
##   0.14042102   0.01496819

Simulate to check modeling strategies

  • In real-world data analysis, there are almost always multiple ways to approach a problem.
  • Small simulations are a useful way to test out strategies for use in a given setting.
  • For modeling batting averages, beta-binomial regression is useful but SLOW.
  • Would it work to use a quasi-binomial glm instead?

Data-generating function

simulate_batting_avgs <- function(players, beta, sigma) {
  at_bats <- sample(career_at_bats, size = players)
  mu <- beta[1] + beta[2] * log(at_bats)
  ability <- rbeta(players, 
                   shape1 = mu / sigma, 
                   shape2 = (1 - mu) / sigma)
  hits <- rbinom(n = players, size = at_bats, prob = ability)
  data_frame(at_bats, hits)
}

new_dat <- simulate_batting_avgs(players = 400, 
                                 beta = c(0.140, 0.015), 
                                 sigma = 1 / 500)

Data-generating function

new_dat
## # A tibble: 400 x 2
##    at_bats  hits
##      <int> <int>
## 1      676   148
## 2     1481   415
## 3      153    30
## 4      665   163
## 5        3     0
## 6       51    11
## 7       44    14
## 8     5127  1342
## 9     4101  1074
## 10       2     0
## # ... with 390 more rows

Modeling function

quasibinomial_CI <- function(dat, level = 0.95) {
  glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
                 family = quasibinomial(link = "identity"))
  b <- coef(glm_fit)
  se <- sqrt(diag(vcov(glm_fit)))
  crit <- qnorm(1 - (1 - level) / 2)
  data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
}

quasibinomial_CI(new_dat)
## # A tibble: 2 x 3
##           term          L          U
##          <chr>      <dbl>      <dbl>
## 1  (Intercept) 0.11609691 0.15552616
## 2 log(at_bats) 0.01294839 0.01773535

Put them together!

lots_of_CIs <- 
  replicate(2000, {
    dat <- simulate_batting_avgs(players = 400, beta = c(0.140, 0.015), sigma = 1 / 500)
    quasibinomial_CI(dat)
  }, simplify = FALSE)

Confidence interval coverage

Confidence interval coverage

bind_rows(lots_of_CIs) %>%
  left_join(true_dat, by = "term") %>%
  mutate(covered = L < val & val < U) %>%
  group_by(term) %>%
  summarise(coverage = mean(covered))
## # A tibble: 2 x 2
##           term coverage
##          <chr>    <dbl>
## 1  (Intercept)   0.8980
## 2 log(at_bats)   0.8675

Simulation studies in formal research

Questions about quantitative methodology:

  • Which type of confidence intervals should be used for the beta-binomial model?
  • Which of these twelve tests should I use for one-way ANOVA when the variances are non-homogeneous?
  • How big a sample is needed to get accurate estimates of variance components in a multi-level logistic regression model?
  • Is it reasonable to use a multivariate normal model to impute missing data, even though the variables look skewed?

Why focus on simulation studies?

Few alternatives for assessing

  • small-sample performance of estimation methods
  • performance of combinations of methods (data analysis pipelines)
  • robustness under model mis-specification
  • comparison of competing methods

Simulation design

Simulation design strategy

  • Write separate functions for each component of the simulation
    • Makes it easier to debug, modify, or re-use code
  • Test each component
  • Run in parallel where possible

Learning more

  • Spring, 2017: Data Analysis, Simulation, & Programming in R
  • My blog: http://jepusto.github.io/
  • Ask faculty for articles with good simulation studies
  • November 9th QM colloquium: Anita Israni on "Running Simulations on the Texas Advanced Computing Cluster"