# Introduction to Growth & Parallel Process Models in {lavaan}

A tutorial to parameterize and plot results of a parallel process model (PPM) in `{lavaan}`.

Christopher Loan
2020-09-08
show code
``````knitr::include_graphics(here::here("images", "tomas-sobek.jpeg"))
`````` photo by Tomas Sobek

# Introduction:

During my MS, I had some research questions that were easily answered with structural equation modeling (SEM). I decided to learn `{lavaan}` because it’s open source and allows simple integration with other `R` packages.

Because I relied heavily on publicly available tutorials, I wanted to create a `{lavaan}` tutorial based on a method I used several times: Growth and Parallel Process Models (PPMs).

Most of the tutorials I have found explain only one factor (e.g., the growth curve page on the `{lavaan}` website) and putting two of these together can be daunting to a new `R` or `{lavaan}` user.

Think of this more as a tutorial of the steps needed to take when fitting growth models in `{lavaan}`, rather than a tutorial solely on either the method or the software.

As a quick terminology aside: I refer to growth models where 2+ factors are modeled “in parallel over time” as parallel process models (PPMs), but they are sometimes referred to differently.

1. Being a SEM method, this assumes some familiarity with SEM, though I hope to present this in a way that is clear to as many people as possible.

2. I expect you to understand or to be able to refer to `{lavaan}` syntax for basic syntax/interpretation.

3. Consider what I give you the nuts and bolts in how to make this work, and you should refer to other resources for more in depth interpretations.

show code
``````library(lavaan)
library(tidyverse)
library(ggthemes)
theme_set(theme_economist())
``````

# Simulating Data for Demonstration

In order to complete this project, I simulated some data (n = 500). For simplicity, I actually used `{lavaan}` to make my data, but other software can do this too. I am reserving the code for this until the end, in case you would like to know how I did that, or if you’re curious to see how closely our PPM approximates the structure of the data we made. Specifically, I made two variables across 4 waves (growth modeling assumes time to be equally spaced). I named the variables `{x1}` =`{x}`at time 1, `{x2}` = `{x}` at time 2, etc. & `{y1}` = `{y}` at time 1, etc.

# Fit statistics

In SEM, goodness-of-fit testing is not free; you need degrees of freedom. Specifically, you can calculate fit on models where the highest term is `{x^(n-2)}` where `{n}` is the number of waves you have. I wanted to show a quadratic fitting process, so I included 4 waves (i.e., x^(4-2) = x^2-order models can be assessed for fit).

It’s good practice to have a set of fit statistcs specified a priori so you aren’t cherry picking good and bad results. Here’s a spread of them I’ve selected, along with ideal levels of fit (commented out).

show code
``````selected_fit_stats <-
c("chisq.scaled",
"df.scaled", ## must be >0 to test G.O.F.
"pvalue.scaled", ## ideally n.s.
"cfi.scaled", ## ideally ≥ 0.95
"rmsea.scaled", ## ideally ≤ 0.05
"rmsea.pvalue.scaled", ## ideally n.s.
"srmr" ## ideally < 0.08
)
``````

# Steps for a PPM:

2. Assess fit of various models for both variables (i.e., no growth, linear growth, quadratic growth, etc.).
3. Combine individual ideal models
4. Assess fit (& modify syntax, typically, but that is nuanced & beyond our scope)
5. Plot the predicted data & compare to step 1

# STEP 1. Plot your data

This should give you a guess at what forms probably will fit. I did some work in `{tidyr}` to move the data around, then plotted in `{ggplot2}`, where we can see individual trajectories for each construct over time. I set the `alpha` parameter very low (0.2) so lines are transparent and we can get at density of distribution.

show code
``````sim_growth_dat\$participant_n <-

x_plot <-
pivot_longer(sim_growth_dat,
cols = x1:x4,
names_to = 'x',
names_prefix = 'x')

individual_x_trajectories <-
ggplot(x_plot,
aes(x = as.numeric(x),
y = value,
group = participant_n,
color = participant_n)) +
geom_line(alpha = 0.2) +
labs(title = 'Observed Trajectories of x',
x = 'Timepoint',
y = 'x') +
xlim(1,4) +
theme(legend.position = 'none')

individual_x_trajectories
`````` You can see that the `{x}` variable is probably linearly increasing, but a very “shallow” quadratic model might fit better so we should check if it fits well. If both fit well, we will compare them with likelihood ratio tests (LRTs).

show code
``````y_plot <-
pivot_longer(sim_growth_dat,
cols = y1:y4,
names_to = 'y',
names_prefix = 'y')

individual_y_trajectories <-
ggplot(y_plot,
aes(x = as.numeric(y),
y = value,
group = participant_n,
color = participant_n)) +
geom_line(alpha = 0.2) +
labs(title = 'Observed trajectories of y',
x = 'Timepoint',
y = 'y') +
xlim(1,4) +
theme(legend.position = 'none')

individual_y_trajectories
`````` It is clear that `{y}` decreases with time on average, but there is heterogeneity (or variability) in what happens for an individual case. It looks like a either a linear or quadratic model will fit best.

Let’s dive in!

# STEP 2. Assess fit of various models

Time to first fit the model to an intercept-only model (i.e., no growth observed).

Next, we’ll look at linear growth (linear + intercept), then quadratic growth (quadratic + linear + intercept). We won’t go higher than that because we cannot assess fit with only 4 waves.

There are lots of nuances here, but I discuss some common choices here.

### Function

The `growth()` call in `{lavaan}` is an easy way to specify a growth model. By default, it sets means/intercepts of observed variables to 0 and determines mean structure for latent variables.

### Estimator

All data used here is continuous, so we’ll stick with maximum likelihood (`estimator = ML`) estimation, and I prefer to use the “robust” variant (`estimator = MLR`) whenever possible to account for non-normal data.

### Missingness

For simplicity, there is no missingness in this data. If there were, we could do Little’s Test of Missing data and see patterns of missingness. Assuming data was missing (completely) at random and we didn’t have too much missingness, we could use Full-information Maximum Likelihood to impute missing data.

show code
``````int_x_mod <-
'
iX =~ 1*x1 + 1*x2 + 1*x3 + 1*x4

'
int_x_fit <-
growth(model = int_x_mod,
estimator = 'MLR',
data = sim_growth_dat)

int_x_fit_stats <-
fitmeasures(
int_x_fit,
selected_fit_stats) %>%
data.frame()

round(int_x_fit_stats, 2)
``````
``````                          .
chisq.scaled        1192.66
df.scaled              8.00
pvalue.scaled          0.00
cfi.scaled             0.02
rmsea.scaled           0.54
rmsea.pvalue.scaled    0.00
srmr                   0.63``````

The no growth model for `{x}` does not fit well by any measure. So we’ll move on.

show code
``````int_y_mod <-
'
iY =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
'
int_y_fit <-
growth(model = int_y_mod,
estimator = 'MLR',
data = sim_growth_dat)

int_y_fit_stats <-
fitmeasures(
int_y_fit,
selected_fit_stats) %>%
data.frame()

round(int_y_fit_stats, 2)
``````
``````                          .
chisq.scaled        3136.54
df.scaled              8.00
pvalue.scaled          0.00
cfi.scaled             0.00
rmsea.scaled           0.88
rmsea.pvalue.scaled    0.00
srmr                   0.98``````

The no growth model for `{y}` does not fit well by any measure either. Next, we will check how well a linear-only growth model fits for the `{x}` & `{y}` variables

show code
``````linear_x_mod <-
'
iX =~ 1*x1 + 1*x2 + 1*x3 + 1*x4
sX =~ 0*x1 + 1*x2 + 2*x3 + 3*x4

'
linear_x_fit <-
growth(model = linear_x_mod,
estimator = 'MLR',
data = sim_growth_dat)

linear_x_fit_stats <-
fitmeasures(linear_x_fit,
selected_fit_stats) %>%
data.frame()

round(linear_x_fit_stats, 2)
``````
``````                       .
chisq.scaled        4.03
df.scaled           5.00
pvalue.scaled       0.54
cfi.scaled          1.00
rmsea.scaled        0.00
rmsea.pvalue.scaled 0.92
srmr                0.02``````

The linear growth model for `{x}` fits well by all measures. This is what we expected based on the graph of observed data. We should test quadratic models for fit, though.

show code
``````linear_y_mod <-
'
iY =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
sY =~ 0*y1 + 1*y2 + 2*y3 + 3*y4
'
linear_y_fit <-
growth(model = linear_y_mod,
estimator = 'MLR',
data = sim_growth_dat)

linear_y_fit_stats <-
fitmeasures(
linear_y_fit,
selected_fit_stats) %>%
data.frame()

round(linear_y_fit_stats, 2)
``````
``````                         .
chisq.scaled        842.85
df.scaled             5.00
pvalue.scaled         0.00
cfi.scaled            0.61
rmsea.scaled          0.58
rmsea.pvalue.scaled   0.00
srmr                  0.55``````

Looks like the linear `{y}` is inadequate across all measures.

##### NOTE:

Quadratic terms represent the average rate of change of the slope across waves. This is synonymous with acceleration in other disciplines, though this terminology is not often used and ‘quadratic growth’ is typically preferred in SEM.

show code
``````quad_x_mod <-
'
iX =~ 1*x1 + 1*x2 + 1*x3 + 1*x4
sX =~ 0*x1 + 1*x2 + 2*x3 + 3*x4
qX =~ 0*x1 + 1*x2 + 4*x3 + 9*x4

'
estimator = 'MLR',
data = sim_growth_dat)

selected_fit_stats)

``````
``````       chisq.scaled           df.scaled       pvalue.scaled
0.02                1.00                0.89
cfi.scaled        rmsea.scaled rmsea.pvalue.scaled
1.00                0.00                0.94
srmr
0.00 ``````

Here we have an interesting issue, as `{x}` fits fairly well to both a linear & a quadratic model, but we have a Heywood case, specifically the estimated observed variable variance is negative. Let’s inspect

show code
``````lavInspect(quad_x_fit, "est")\$theta
``````
``````   x1     x2     x3     x4
x1  1.171
x2  0.000  0.718
x3  0.000  0.000  1.213
x4  0.000  0.000  0.000 -0.064``````

Our `{x4}` variable has a negative estimate for the observed variable’s variance. This is a problem, and indicates the model may not represent the data well. There are ways to use parameter constraints to see if the fit is good if we “force” a change to this parameter, but that’s beyond our scope today.

For teaching purposes, we’ll pretend we didn’t have a Heywood case to do the next step.

When you have two well-fitting nested models (in our case linear & quadratic models), we can compare the fit formally with Likelihood Ratio Tests (LRTs). The null hypothesis is essentially that there is no difference in the variance explained by the two models. Parsimony is preferred in a situation where the variance explained is equivalent.

##### NOTE:

Some will say this test is only valid if the models have n.s. p-values and there are no Heywood cases.

show code
``````lavTestLRT(linear_x_fit, quad_x_fit)
``````
``````Scaled Chi-Squared Difference Test (method = "satorra.bentler.2001")

lavaan NOTE:
The "Chisq" column contains standard test statistics, not the
robust test that should be reported per model. A robust difference
test is a function of two standard (not robust) statistics.

Df    AIC    BIC  Chisq Chisq diff Df diff Pr(>Chisq)
linear_x_fit  5 7566.4 7604.3 4.0675     4.0026       4     0.4056``````

Here, a non-significant p-value tells us that the linear model is preferred here.

Now onto the quadratic `{y}` model.

show code
``````quad_y_mod <-
'
iY =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
sY =~ 0*y1 + 1*y2 + 2*y3 + 3*y4
qY =~ 0*y1 + 1*y2 + 4*y3 + 9*y4

'
growth(
estimator = 'MLR',
data = sim_growth_dat)

fitmeasures(
selected_fit_stats
)

``````
``````       chisq.scaled           df.scaled       pvalue.scaled
0.190               1.000               0.663
cfi.scaled        rmsea.scaled rmsea.pvalue.scaled
1.000               0.000               0.813
srmr
0.001 ``````

It’s clear that only the `{x}` linear model was a good one by all fit statistics. We could actually easily see this earlier in Step 1, but it is good that we checked.

When using real—world data this may or may not be the case; I simulated this data specifically to be clean linear and quadratic models for `{x}` & `{y}` respectively.

# STEP 3. Building the PPM

Now that we are fairly confident in functional forms, let’s model them together.

##### NOTE:

For simplicity we are not specifying conditional models which make future growth terms conditional upon earlier growth terms. The distinction between the two is subtle and deserves its own conversation later. For now, just assume growth terms are simply correlated, rather than causally linked.

show code
``````full_model <-
'
## intercept & slope growth terms for X
iX =~ 1*x1 + 1*x2 + 1*x3 + 1*x4
sX =~ 0*x1 + 1*x2 + 2*x3 + 3*x4

## intercept, slope, & quadratic terms for Y
iY =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
sY =~ 0*y1 + 1*y2 + 2*y3 + 3*y4
qY =~ 0*y1 + 1*y2 + 4*y3 + 9*y4

## regress growth terms on predictor
qY + iY + sX + iX ~ predictor
sY ~ a1*predictor

## regress outcome on growth terms
outcome ~ iX + sX + iY + b1*sY + qY

## testing indirect effect
## predictor --> sY --> outcome

predictor_sY_outcome:=a1*b1
'

full_fit <-
growth(
model = full_model,
estimator = 'MLR',
data = sim_growth_dat)

full_fit_stats <-
fitmeasures(
full_fit,
selected_fit_stats)

round(full_fit_stats,2)
``````
``````       chisq.scaled           df.scaled       pvalue.scaled
28.23               34.00                0.75
cfi.scaled        rmsea.scaled rmsea.pvalue.scaled
1.00                0.00                1.00
srmr
0.03 ``````

# Full Model Fit

We can see that this model fits well across all fit statistics chosen. This model is deemed acceptable and we can now interpret the model results

show code
``````summary(full_fit)
``````
``````lavaan 0.6-9 ended normally after 65 iterations

Estimator                                         ML
Optimization method                           NLMINB
Number of model parameters                        29

Number of observations                           500

Model Test User Model:
Standard      Robust
Test Statistic                                 28.456      28.233
Degrees of freedom                                 34          34
P-value (Chi-square)                            0.736       0.746
Scaling correction factor                                   1.008
Yuan-Bentler correction (Mplus variant)

Parameter Estimates:

Standard errors                             Sandwich
Observed information based on                Hessian

Latent Variables:
Estimate  Std.Err  z-value  P(>|z|)
iX =~
x1                1.000
x2                1.000
x3                1.000
x4                1.000
sX =~
x1                0.000
x2                1.000
x3                2.000
x4                3.000
iY =~
y1                1.000
y2                1.000
y3                1.000
y4                1.000
sY =~
y1                0.000
y2                1.000
y3                2.000
y4                3.000
qY =~
y1                0.000
y2                1.000
y3                4.000
y4                9.000

Regressions:
Estimate  Std.Err  z-value  P(>|z|)
qY ~
predictor         0.041    0.048    0.850    0.395
iY ~
predictor         0.034    0.060    0.564    0.573
sX ~
predictor         0.008    0.051    0.161    0.872
iX ~
predictor         0.067    0.060    1.119    0.263
sY ~
predictor (a1)    2.022    0.079   25.545    0.000
outcome ~
iX                1.994    0.145   13.755    0.000
sX               -0.017    0.146   -0.115    0.908
iY               -0.028    0.186   -0.149    0.881
sY        (b1)    2.973    0.104   28.596    0.000
qY               -0.242    0.161   -1.510    0.131

Intercepts:
Estimate  Std.Err  z-value  P(>|z|)
.x1                0.000
.x2                0.000
.x3                0.000
.x4                0.000
.y1                0.000
.y2                0.000
.y3                0.000
.y4                0.000
.outcome           0.000
.iX                1.976    0.059   33.700    0.000
.sX                0.991    0.051   19.350    0.000
.iY                0.031    0.061    0.508    0.611
.sY               -1.153    0.082  -14.019    0.000
.qY               -1.548    0.048  -32.151    0.000

Variances:
Estimate  Std.Err  z-value  P(>|z|)
.x1                1.067    0.099   10.777    0.000
.x2                0.885    0.069   12.924    0.000
.x3                0.983    0.104    9.492    0.000
.x4                1.113    0.197    5.646    0.000
.y1                0.961    0.107    8.991    0.000
.y2                0.868    0.083   10.470    0.000
.y3                0.996    0.179    5.557    0.000
.y4                2.467    0.949    2.601    0.009
.outcome           0.814    0.927    0.878    0.380
.iX                1.043    0.100   10.444    0.000
.sX                1.094    0.076   14.393    0.000
.iY                1.040    0.108    9.595    0.000
.sY                1.131    0.125    9.078    0.000
.qY                0.927    0.068   13.684    0.000

Defined Parameters:
Estimate  Std.Err  z-value  P(>|z|)
predctr_sY_tcm    6.011    0.159   37.874    0.000``````

# Interpretation

I am going to trust you understand how to read `{lavaan}` here, so I’m not going to walk through this line by line. The headlines are `{iX}` & `{sY}` significantly predict `{outcome}`, with the latter predicting it to a much larger degree than the former.

This means baseline levels of `{x}` predict outcome, as does the linear rate of change of `{y}`, both positively such that higher levels of initial `{x}` or faster rates of change in `{y}` lead to higher levels of `{outcome}`.

In our example, I included a time-invariant covariate `{predictor}`. Since `{predictor}` significantly predicted `{sY}` and `{sY}` predicted `{outcome}`, we would want to look for mediation. You may trust the estimate given by simply the product of each coefficient and it’s significance test, but it is better to rely on other methods to test indirect effects.

# Bootstrapping

Bootstrapping is a resampling method (with replacement) that is appreciated for building confidence intervals around parameter estimates of indirect effects, at least in part, because they do not make distributional assumptions about the indirect effect. This creates a more reliable test than standard significance tests. A quick Google Scholar search can inform you on this topic more deeply than this post, but I used this to test the indirect effects using 5000 simulated data sets.

show code
``````final_fit_boot<-
growth(full_fit,
data = sim_growth_dat,
estimator = "ML",
meanstructure = T,
se = "bootstrap",
bootstrap = 100, # ~5000 better
parallel = "multicore")

parameterEstimates(
final_fit_boot,
level = .95,
boot.ci.type = "bca.simple",
stand = T)[61,c(4,5,9,10)]
``````
``````                  label   est ci.lower ci.upper
61 predictor_sY_outcome 6.011    5.707    6.355``````

Since this is a tutorial, I set our model to only run 100 times for speed, but you’d likely want to set `bootstrap = 5000` for better practice.

Furthermore, looking at the 95% confidence intervals, we find the same result as we did regarding the indirect effect as without bootstrapping. These are the results I trust more, especially since we used `bca.simple`.

Most likely this is what we wanted to do with this model (at least in my field). Assess if a predictor influences growth in a construct, then see if there is an indirect effect (and quanitify its magnitude).

##### NOTE:

I made it so only the label, unstandardized estimate and confidence intervals showed up with the `[61,c(4,5,9,10)]` portion of code, but you should not trust the p-values from this parameterEstimates() if you go with anything but the default (i.e., `boot.ci.type = norm`). This is because the p-value is built assuming a normal distribution; this eliminates some of the benefits of bootstrapping gained by `bca.simple` and other methods, but gives you a p-value, which matters a lot to some.

# Plotting

We can plot the growth of these traits, as well with `{ggplot2}`, which helps verify you did this correctly. I won’t go into it here, but these plots can be modified with facet wrapping and other grouping/coloring methods to see how different groups have different slopes, for example.

First we extract factor scores & save this as a data frame, then we can plot. For a linear model, this is pretty simple:

show code
``````plot_dat <-
lavPredict(full_fit) %>%
data.frame()

plot_dat\$participant_n <-
1:nrow(plot_dat)

plot_dat\$predictor <-
sim_growth_dat\$predictor

predicted_x_trajectories <-
ggplot(data=plot_dat) +
scale_x_continuous(
name = "Timepoint",
limits=c(0, 3),
breaks = c(0, 1, 2, 3),
labels = c('1', '2', '3', '4')
) +
scale_y_continuous(
name = "x",
limits=c(-5,15)
)+
geom_abline(
data = plot_dat,
mapping =
aes(slope=sX,
intercept=iX,
color = participant_n),
alpha = 0.2) +
labs(title = 'Predicted trajectories of x')  +
theme(legend.position = 'none')
``````

We can compare those to the observed trajectories (top) to the predicted trajectories (bottom) with `{gridExtra}` to merge our plots.

show code
``````gridExtra::grid.arrange(
individual_x_trajectories,
predicted_x_trajectories,
ncol = 1)
`````` For a quadratic model, this is more challenging, but here’s some code to do it:

show code
``````plot_dat <-
lavPredict(full_fit) %>%
data.frame()

plot_dat\$participant_n <-
1:nrow(plot_dat)

plot_dat\$predictor <-
sim_growth_dat\$predictor

qY <- plot_dat\$qY
sY <- plot_dat\$sY
iY <- plot_dat\$iY

test <- function(y) {qY*y ^ 2 + sY*y + iY}

plot_dat_2 <-
data.frame(`T1` = test(0),
`T2` = test(1),
`T3` = test(2),
`T4` = test(3),
row.names = 1:nrow(plot_dat))

names(plot_dat) <- c('1', '2', '3', '4')

test0 <- test(0)
test1 <- test(1)
test2 <- test(2)
test3 <- test(3)

Y_df <-
c(test0,
test1,
test2,
test3)

df_long_wave <-
c(rep(1,500),
rep(2, 500),
rep(3, 500),
rep(4, 500)
)

plot_dat_2 <- data.frame(
participant_n =
rep(1:500,4),
Timepoint =
df_long_wave,
y = Y_df)

predicted_y_trajectories <-
plot_dat_2 %>%
ggplot(
aes(x = Timepoint,
y = y,
group = participant_n,
color = participant_n)) +
geom_line(formula = y ~ x + I(x^2),
method = "lm",
se = F,
alpha = 0.2) +
labs(title ='Predicted trajectories of y') +
theme(legend.position = 'none')
``````

Again, can compare those to the observed trajectories (top) to the predicted trajectories (bottom).

show code
``````gridExtra::grid.arrange(
individual_y_trajectories,
predicted_y_trajectories,
ncol = 1)
`````` # Next Steps:

If you were to be working on this, I would work further to make the model more parsimoneous, as well as seeing if alternate modifications would increase model fit. Ways to do this include, but are not limited to, constraining residuals, assessing the role of the predictor on the observed vs. manifest model, making the growth terms conditional upon earlier terms, etc.

# Looking at the simulation call

Here is the exact simulation call I did in case anyone is curious. I wanted the data to be interesting, so I made sure to set some things (e.g., variances, latent means/intercepts, relationships between variables, etc.)

show code
``````growth_mod <-
'

## intercept & slope growth terms for X
iX =~ 1*x1 + 1*x2 + 1*x3 + 1*x4
sX =~ 0*x1 + 1*x2 + 2*x3 + 3*x4

## intercept, slope, & quadratic terms for Y
iY =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
sY =~ 0*y1 + 1*y2 + 2*y3 + 3*y4
qY =~ 0*y1 + 1*y2 + 4*y3 + 9*y4

## set variances

y4 ~~ 2*y4
x4 ~~ 1*x4

## set latent means/intercepts
iX ~ 2*1
sX ~ 1*1
sY ~ -1*1
qY ~ -1.5*1

sY ~ 2*predictor

outcome ~ 2*iX + 3*sY
'
``````

In the data I simulated, Variable `{x}` is changing linearly over time and variable `{y}` is changing quadratically, which his what we found.

show code
``````sim_growth_dat <- simulateData(
model = growth_mod,
model.type = "growth",
seed = 82020,
orthogonal = F,
auto.cov.y = T,
auto.var = T
)
``````