Simulating Multilevel Linear Data with Interaction

Multi-Level Modeling MLM Plotting Simulated Data Tutorial

function to simulate multilevel data with interaction.

Christopher Loan
2021-07-28
show code
library(tidyverse)
library(glmertree)

Disclaimer

My code is available for use and alteration, but please give credit with a citation & link to the page.

Goal

Share code for simulating multilevel data, with optional interaction term.

Function to simulate 2 level data with an interaction

This function allows you to specify sample sizes, two main effects (main_x & main_z), an interaction, and other features of the data. If you specify interaction = 0, you can simply specify multilevel data with no interaction term. Of course the code could be extended to include as many effects as you need. It also provides a variable (w) that’s not associated with the outcome (outcome).

show code
simulate_interaction_w_2_levels <-
  function(
    n, ## number level 1 units
    j, ## number level 2 units
    intercept_lv1, ## intercept at level 1
    main_x, ## main effect of x
    main_z, ## main effect of z
    interaction, ## interaction of x and z
    residual_var_sd_lv1, ## standard deviation of residuals at level 1
    random_int_mean_lv2, ## mean of random intercept at level 2
    random_int_sd_lv2, ## standard deviation of random intercept at level 2,
    start_seed = 123
  ){
    
    ## placeholder for level 2 outcomes
    outcomes_j <- vector("list", j)
    
    ## for each variable, make list 
    ## as long as level 2
    ## fill each element with a list of level 1 outcomes
    
    
    x <- vector("list", j) 
    z <- vector("list", j) 
    w <- vector("list", j) 
    
    ## School distribution (intercept at level 2)
    ## Standard deviation of random intercept at level 2
    set.seed(start_seed)
    
    a_j <- rnorm(j, random_int_mean_lv2, random_int_sd_lv2)
    
    for(i in 1:j) {
      
      ## make a level 1 predictor variable:
      ## set multiple seeds that change each iteration
      ## prevents identical cases with seed
      
      set.seed(start_seed+i)
      x[[i]] <- rnorm(n)
      set.seed(start_seed-i)
      z[[i]] <- rnorm(n)
      set.seed(-start_seed + i)
      outcomes_j[[i]] <- 
        rnorm(
          n, 
          intercept_lv1 + 
            interaction*z[[i]]*x[[i]] + 
            main_x*x[[i]] + 
            main_z*z[[i]] + 
            a_j[i], 
          ## standard deviation of residual variance
          residual_var_sd_lv1
        )
      set.seed(start_seed*197+197*i)
      w[[i]] <- rnorm(n)
    }
    
    outcomes_df <- 
      data.frame(
        id = rep(1:j, each = n),
        x = unlist(x),
        z = unlist(z),
        outcome = unlist(outcomes_j),
        w = unlist(w)
      )
    
    return(outcomes_df)
  }
show code
# baseline theme for plots here
project_theme <- 
  theme(
    axis.line = element_line(size = 3, lineend = 'round'), 
    legend.position = 'none',
    plot.title.position = 'plot', 
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(size = 18),
    axis.text = element_text(size = 16),
    axis.title.y = element_text(size = 20),
    plot.title = element_text(size = 25),
    strip.text = element_text(size = 25),
  )

Simulating data

We have 15 level 1 units, 71 level 2 units. The level 1 intercept of the outcome is 4.25 with a main effect of 1.25 for x, 2.00 for z, and 0.75 for their interaction. The function also makes a nuisance variable w, which does not contribute to outcome at all.

show code
sim_dat <- 
  simulate_interaction_w_2_levels(
    n = 15, ## number level 1 units at each level 2
    j = 71, ## number level 2 units
    intercept_lv1 = 4.25, ## intercept at level 1
    main_x = 1.25, ## main effect of x
    main_z = 2.00, ## main effect of z
    interaction = 0.75, ## interaction of x and z
    residual_var_sd_lv1 = 2.00, ## standard deviation of residuals at level 1
    random_int_mean_lv2 = 0, ## mean of random intercept at level 2
    random_int_sd_lv2 = 1.00, ## standard deviation of random intercept at level 2,
    start_seed = 123 ## ensure you can reproduce 
  ) %>% 
  tibble()

Put formally, the simulation is approximating:

\[ \begin{aligned} \operatorname{outcome}_{i} &\sim N \left(\mu, \sigma^2 \right) \\ \mu &=\alpha_{j[i]} + \beta_{1}(\operatorname{x}) + \beta_{2}(\operatorname{z}) + \beta_{3}(\operatorname{w}) + \beta_{4}(\operatorname{x} \times \operatorname{z}) \\ \alpha_{j} &\sim N \left(\mu_{\alpha_{j}}, \sigma^2_{\alpha_{j}} \right) \text{, for id j = 1,} \dots \text{,J} \end{aligned} \]

where

\[\mu = 4.25\] \[\sigma = 2.00\] \[\beta_{1} = 1.25\] \[\beta_{2} = 2.00\] \[\beta_{3} = 0\] \[\beta_{4} = 0.75\] \[J = 71\] \[\mu_{\alpha_{j}} = 0\] \[\sigma_{\alpha_{j}}= 1.00\]

Describing data & verifying simulation results

Here are the overall averages for the 1065 cases, with 71 level two units.

show code
sim_dat %>% 
  summarize(
    x_mean = mean(x), 
    x_sd = sd(x),
    z_mean = mean(z),
    z_sd = sd(z),
    outcome_mean = mean(outcome), 
    outcome_sd = sd(outcome), 
    w_mean = mean(w), # w has no relation to outcome
    w_sd = sd(w) # w has no relation to outcome
  )
# A tibble: 1 x 8
  x_mean  x_sd  z_mean  z_sd outcome_mean outcome_sd  w_mean  w_sd
   <dbl> <dbl>   <dbl> <dbl>        <dbl>      <dbl>   <dbl> <dbl>
1 0.0900 0.946 -0.0197  1.00         4.44       3.38 -0.0208 0.993

Distribution of Variables

show code
figure_1 <- 
  sim_dat %>% 
  pivot_longer(
    cols = -id, 
    names_to = 'variable',
    values_to = 'statistic'
    ) %>% 
  group_by(variable) %>% 
  mutate(average_value = mean(statistic)) %>% 
  ungroup() %>% 
  ggplot(
    aes(
      x = statistic, 
      fill = variable)
  ) +
  colorblindr::scale_fill_OkabeIto() +
  colorblindr::scale_color_OkabeIto() +
  geom_vline(aes(xintercept = average_value), size = 1.25, linetype = 2) +
  geom_density(alpha = 0.7, color = 'black', size = 2) +
  facet_wrap(~variable, scales = 'free') +
  labs(
    y = element_blank(),
    x = 'Distribution of Simulated Values', 
    title = 'Figure 1. Distribution of Simulated Variables', 
    caption = 'Note the scales are free on each plot\nDotted Lines = Average Values') +
  theme_minimal() +
  project_theme + 
  theme(axis.text.y = element_blank())
figure_1

When you group by the id, the simulated averages are normally distributed around the values (Figure 1).

Distribution of Group Averages

show code
figure_2 <- 
  sim_dat %>% 
  group_by(id) %>% 
  summarize(
    x = mean(x), 
    z = mean(z),
    outcome = mean(outcome), 
    w = mean(w)
    ) %>% 
  pivot_longer(
    cols = -id, 
    names_to = 'variable',
    values_to = 'statistic'
    ) %>% 
  group_by(variable) %>% 
  mutate(average_value = mean(statistic)) %>% 
  ungroup() %>% 
  ggplot(
    aes(
      x = statistic, 
      fill = variable)
  ) +
  colorblindr::scale_fill_OkabeIto() +
  colorblindr::scale_color_OkabeIto() +
  geom_vline(aes(xintercept = average_value), size = 1.25, linetype = 2) +
  geom_density(alpha = 0.7, color = 'black', size = 2) +
  facet_wrap(~variable, scales = 'free') +
  labs(
    y = element_blank(),
    x = 'Distribution of Simulated Values (group averages)', 
    title = 'Figure 2. Distribution of Group Averages for\nSimulated Variables', 
    caption = 'Note the scales are free on each plot') +
  theme_minimal() +
  project_theme + 
  theme(axis.text.y = element_blank())
figure_2

show code
figure_3a <-   
  sim_dat %>% 
  ggplot(aes(x = x, y = outcome)) + 
  geom_point() +
  geom_smooth(method = 'lm') +
  theme_minimal() +
  project_theme +
  labs(
    title = 'Figure 3. Association of Outcome & Each Covariate',
    subtitle = '3a. Association of Outcome & x'
    ) +
  theme(plot.subtitle = element_text(size = 15))

figure_3b <-   
  sim_dat %>%
  ggplot(aes(x = z, y = outcome)) + 
  geom_point() +
  geom_smooth(method = 'lm') +
  theme_minimal() +
  project_theme +
  labs(subtitle = '3b. Association of Outcome & z') +
  theme(plot.subtitle = element_text(size = 15))

figure_3c <- 
  sim_dat %>% 
  ggplot(aes(x = x*z, y = outcome)) + 
  geom_point() +
  geom_smooth(method = 'lm') +
  theme_minimal() +
  project_theme +
  labs(subtitle = '3c. Association of Outcome & Interaction') +
  theme(plot.subtitle = element_text(size = 15))

ggpubr::ggarrange(figure_3a, figure_3b, figure_3c, ncol = 1)

Fitting multilevel model to the data

The simulation looks to have worked as planned. With larger samples, the effects becomes closer to what we specified, obviously, but that was omitted for brevity. Let’s see what this data looks like with an {lme4} to be sure.

show code
confirmation_lmer <- 
  lmer(
    data = sim_dat, 
    formula = outcome ~ x * z + w + # equivalent to x + z + x : z
      (1 | id)
  )
summary(confirmation_lmer)
Linear mixed model fit by REML ['lmerMod']
Formula: outcome ~ x * z + w + (1 | id)
   Data: sim_dat

REML criterion at convergence: 4729.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4943 -0.6663 -0.0260  0.6799  2.9690 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 0.7374   0.8587  
 Residual             4.5271   2.1277  
Number of obs: 1065, groups:  id, 71

Fixed effects:
            Estimate Std. Error t value
(Intercept)  4.37559    0.12117  36.111
x            1.09797    0.07135  15.389
z            1.95442    0.06737  29.011
w            0.06962    0.06750   1.031
x:z          0.80677    0.07160  11.268

Correlation of Fixed Effects:
    (Intr) x      z      w     
x   -0.053                     
z    0.012 -0.014              
w    0.013 -0.024 -0.027       
x:z -0.005 -0.049 -0.141 -0.003

We see how closely the MLM estimates identify the simulated effects. We see small standard errors and large t-values for these estimates, too. As we’d like to see, w does not have significant associations with the outcome.