A Structural Equation Model from Scratch

In this post, I will build a latent variable Structural Equation Model (SEM) from scratch. Latent variable models are useful for reducing many observations into a single variable. For example, let’s say there a battery of questionnaires that assess depressive symptoms, and those batteries consist of many different items. It is possible to construct a latent variable of depression from those batteries and items.

I use SEM a ton in my research, but I’ve always used external software to fit the models (e.g., lavaan, MPlus, OpenMX, so I wanted to dig into the mechanics of SEM to understand it a bit better. Moreover, SEM is incredibly flexible, and understanding its mechanics can also help to derive all sorts of others models, such as random effects models or mixture models.

The material from this post is borrowed from Rubin & Thayer (1982) and Craig Enders’ book on missing data

The latent variable model is given by:

\(Y = BZ + \Sigma\)

where Y is an \(n\) X \(p\) observed data matrix; Z is the \(n\) X \(q\) unobserved factor matrix, \(B\) is the factor loading matrix, and \(\Sigma\) is the residual covariance. A key assumption in factor analysis is that once the factors are accounted for, the variables are independent, which means that there will be no covariances in the off-diagonal of \(\Sigma\), so we can replace \(\Sigma\) with \(\tau^2\) = diag(\(\tau_1^2\), … \(\tau_p^2\)).

Note: in the above equation, we are not including the intercept because will center the Y variables at their sample means, removing the intercept.

This equation is the same as a regression equation. The issue is that Z is unobserved, so we need to estimate the entire right hand side of the equation from Y. In order to estimate the right hand side, we will use the Expectation-Maximization (E-M) algorithm.

The E-M algorithm is a two-step algorithm that starts with the E-Step: an initial estimate of the mean vector and covariance matrix that predict the missing variables from the observed variables. The M-Step subsequently applies standard concrete data formulas to generate updated estimates of the mean vector and covariance matrix. These two-steps are repeated until the mean vector and covariance matrix no longer change between consecutive M-steps.

This post will detail the E-M algorithm, but another helpful blog post can be found here.

Before we get to the E-M algorithm, however. It is important to review some distributions.

The Univariate Normal Distribution

THe likelihood of the normal distribution is expressed by the following equation:

\(L_i = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-.5(y_i - u)^2}{\sigma^2}}\)

The most important part of this equation is the Mahalanobis distance term that quantifies the distance between a score and the mean (\(y_i -u\)). We won’t be using the univariate normal distribution in this example because factor analyses uses the multivariate normal distribution, but it is worthwhile to review in order to better understandt the multivariate normal.

The Multivariate Normal Distribution

The likelihood of multivariate normal distribution is represented by the following equation:

\(L_i = \frac{1}{2\pi^{k/2}|\Sigma|^{\frac{1}{2}}}e^{-.5(Y_i-u)^T\Sigma^-1(Y_i-u)}\)

The univariate density has three components: the score value (\(y_i\)), population mean (\(u\)), and population variance (\(\sigma\)). In the multivariate normal, these quantities appear as matrices:

  • Each individual has a set of k scores in the score vector \(Y_i\)
  • \(u\) is a mean vector
  • \(\Sigma\) is a covariance matrix

Just as in the univariate example, the most important part of this equation is the Mahalanobis distance value: \((Y_i-u)^T\Sigma^-1(Y_i-u)\)

The log-likelihood of the multivariate normal distribution is:

\(logL_i = -\frac{k}{2}log(2\pi)-\frac{1}{2}log|\Sigma| - \frac{1}{2}(Y_i - u)^T\Sigma^-1(Y_i - u)\)

EM Algorithm

As stated above, the E-M algorithm iterates between calculating the expected predicted value of the missing variable (E-Step) and finding the most likely values of the parameters to be estimated (M-Step). In a factor analysis, the E-M algorithm maximizes the likelihood of the data, integrated over the missing data Z:

  • \(\frac{n}{2}\Sigma_{j=1}^{p}log\tau_{j}^2 - \frac{1}{2}\Sigma_{n=1}^{n}\Sigma_{j=1}^{p}\frac{(Y_ij - Z_iB_j)^2}{\tau^2} - \frac{n}{2}log(|R|) - \frac{1}{2}\Sigma_{i = 1}^nZ_iR^-1Z'_j\)

Breaking this equation down a bit further, we have the first part:

  • \(-\frac{n}{2}\Sigma_{j=1}^{p}log\tau_{j}^2 - \frac{1}{2}\Sigma_{n=1}^{n}\Sigma_{j=1}^{p}\frac{(Y_ij - Z_iB_j)^2}{\tau^2}\)

This is the density of the multivariate normal distribution of the predicted values of y (see the log likelihood of the multviarate normal above). The difference between the above equation and the original multivariate normal is we

  1. replaced \(u\) with \(Z_iB_j\)
  2. replaced the determinant of \(\Sigma\) with \(\tau^2\) because the residual covariance is assumed to be 0, so all we are left with is the residual variance for each variable
  3. dropped \(\pi\) because it’s a constant and should not affect the maximization process

The second part of this equation:

  • \(-\frac{n}{2}log(|R|) - \frac{1}{2}\Sigma_{i = 1}^nZ_iR^-1Z'_j\)

is the probability for Z.

The E-Step

Generally, the purpose of the “E-step” step is to fill in the missing values. In our case, we want to fill in the entire right hand side of \(Y = BZ + \Sigma\).

Fortunately, we can fill in the right hand side entirely with a few sufficient statistics, which will also allow us to find the expectation of the above equation. Sufficient statistics are functions of the data that can completely summarize the data likelihood. The sufficient statistics are given by:

  • \(C_{yy} = \Sigma_1^n\frac{Y'_iYi}{n}\), a \(p\) X \(p\) observed matrix.
  • \(C_{zz} = \Sigma_1^n\frac{Z'_iZi}{n}\), a \(q\) X \(q\) matrix
  • \(C_{yz} = \Sigma_1^n\frac{Y'_iZi}{n}\), a \(p\) X \(q\) matrix

With these sufficient statistics, we can fill in mean of Z (we assume this is 0 because we assume Z is standardized), the variance of Z, and the cross product of Y and Z.

Given \(\tau^2\), \(R\), and \(B\) the \((Y_i, Z_i)\) are independent and identically distributed random variables (\(p + q\))-variate normal. Thus, given \(\tau^2, R\), and \(B\), the conditional distribution of \(Z_i\) given \(Y_i\) is q-variate normal with mean \(\delta Y_i\) and covariance \(\Delta\), where the regression coefficient \(\delta\) and residual covariance matrix \(\Delta\) are given by:

\(\delta = (\tau^2 + B'B)^-1(B)\) \(\Delta = R - (RB) + (\tau^2 + BB)^-1(B)\)

It is important to remember that Z is standardized (mean 0 and variance 1), so it is sufficient to remove the residual covariance of Y from the denominator of \(delta\) by dividing by the residual covariance of Y (\(\tau^2 + B'B\)) to get a denominator of 1. Similary, since \(Z\) is standardized, it is sufficient to subtract \(B\) from the identity to get the residual covariance matrix.

And finally, the conditional expectations of the sufficient statistics given parameters \(\tau^2, B, R\), observed data \(Y_i, ..., Y_n\) are

  • \(E(C_{yy}| Y, \phi) = C_{yy}\)
  • \(E(C_{yz}| Y, \phi) = C_{yy}\delta\)
  • \(E(C_{zz}| Y, \phi) = \delta'C_{yy}\delta + \Delta\)

The M-Step

Because the \(Y_jJ = 1, ..., p\) given Z are conditionally independent, we can deal with each Y variable separately.

Consider the the \(jth\) Y-variable with regression coefficient \(Bj\) on Z. The maximum likelihood estimator of \(B_j\) from the estimated sufficient statistics is $B_j = [(’C_{yy}+ )_j]^{-1}(’C{yy})_j $ and the maximum likelihood estimator of \(\tau^2\) from the estimate sufficient statistics is:

\(\tau^2 = C_{yyj} - (C_{yy}\delta)_j[\delta'C_{yy}\delta + \Delta)_j]^{-1}(\delta C_{yy})_j\)

where \(C_{yyj}\) is the \(j\)th diagonal element of \(C_{yy}\).

Finally, the maximum likelihood estimayor of R is is the expectation of \(C_{zz}\) normed to be a correlation matrix. However, in our example, we are going to assume the factors are uncorrelated, so \(R = I\), \(I\) being the identity matrix.

An Example!

In this example, we are going to recreate the factors derived from Rubin & Thayer (1982).

Let’s start with the following covariance matrix of \(C_{yy}\)

C_yy = matrix(
  c(1.0, .554, .227, .189, .461, .506, .408, .280, .241,
  .554, 1.0, .296, .219, .479, .530, .425, .311, .311,
  .227, .296, 1.0, .769, .237, .243, .304, .718, .730,
  .189, .219, .769, 1.0, .212, .226, .291, .681, .661,
  .461, .479, .237, .212, 1.0, .520, .514, .313, .245,
  .506, .530, .243, .226, .520, 1.0, .473, .348, .290,
  .408, .425, .304, .291, .514, .473, 1.0, .374, .306,
  .280, .311, .718, .681, .313, .348, .374, 1.0, .692,
  .241, .311, .730, .661, .245, .290, .306, .692, 1.0),
  nrow = 9, ncol = 9)

And now we need to make some starting values for the residual variances \(\tau^2\) and the regression coefficient matrix \(B\).

In our model, we are assuming that these 9 variables can be represented by 4 factors. More specifically, we assume that variables 1-4 have 0 coefficients by Factor 4, and variables 5-9 have 0 coefficients on factor 3, and otherwise there are no restrictions.

B = matrix(ncol = 9, nrow = 4, 
                         c(.7, .6, .3, 0,
                         .7, .6, .3, 0,
                         .7, .6, .3, 0,
                         .7, .6, .3, 0, 
                         .7, .6, 0, .3,
                         .7, .6, 0, .3,
                         .7, .6, 0, .3,
                         .7, .6, 0, .3,
                         .7, .6, 0, .3))


t = rep(.06, 9)

The E-Step

Again, we are simply taking B and canceling out the residual covariance of Y. With this new parameter, we can now generate the predicted value of Z.

beta_est = solve(diag(t) + t(B) %*% (B)) %*% t(B)

Here we subtract our the square of new parameter to predict Z from the identity matrix. Again, we use the identity matrix because we assume the factors are uncorrelated with a mean of 0 and a variance of 1.

resid_est = diag(4) - B %*% solve(diag(t) + t(B) %*% (B)) %*% t(B)

Finally, we have the exepctations of the sufficient statistics.

Y is observed, so its expectation is the same as observed covariance matrix.

E_cyy = C_yy

Mutiply the covariance matrix of Y by the parameter for Z to get the covariance between Y and Z.

E_cyz = C_yy %*% beta_est

And finally, we have the covariance of Z.

E_czz = t(beta_est) %*% C_yy %*% beta_est + resid_est

The M-Step

Take the covariance between Y and Z and divide by the variance of Z to get the new estimate for B.

beta_max = solve(E_czz) %*% t(E_cyz)

Since we are assuming factor scores of 0 for some variables, we do that manually here:

beta_max[3,5:9] = 0
beta_max[4,1:4] = 0

Take the covariance of Y and subtract it from the square of the predicted values

resid_max = diag(E_cyy - (E_cyz %*% solve(E_czz) %*% t(E_cyz)))

And that’s it!

And now it’s time to put the E and M steps into functions.

E-M equations

e_step = function(B, t){
  
  beta_est = solve(diag(t) + t(B) %*% (B)) %*% t(B)
  resid_est = diag(4) - B %*% (solve(diag(t) + t(B) %*% (B))) %*% t(B)
  
  E_cyy = C_yy
  E_cyz = C_yy %*% beta_est
  E_czz = t(beta_est) %*% C_yy %*% beta_est + resid_est
  
  return (list('E_cyy' = C_yy, 'E_cyz' =  E_cyz, 'E_czz' = E_czz, 
               'beta_est' = beta_est, 'resid_est' = resid_est))
  
  
}



m_step = function(estimates){

  E_cyy = estimates[['E_cyy']]
  E_czz = estimates[['E_czz']]
  E_cyz = estimates[['E_cyz']]
  
  beta_max = solve(E_czz) %*% t(E_cyz)
  beta_max[3,5:9] = 0
  beta_max[4,1:4] = 0
  
  resid_max = diag(E_cyy - (E_cyz %*% solve(E_czz) %*% t(E_cyz)))
  

  return (list('beta_max' = beta_max, 'resid_max' = resid_max))
  
}

And now let’s try this out!

Implementing the algorithm

new_estimates = NULL

for( run in c(1:50)){
  
  if (! is.null(new_estimates)){
    B = new_estimates[['beta_max']]
    t = new_estimates[['resid_max']]
  }else{
    B = matrix(ncol = 9, nrow = 4, 
                         c(.7, .6, .3, 0,
                         .7, .6, .3, 0,
                         .7, .6, .3, 0,
                         .7, .6, .3, 0, 
                         .7, .6, 0, .3,
                         .7, .6, 0, .3,
                         .7, .6, 0, .3,
                         .7, .6, 0, .3,
                         .7, .6, 0, .3))

    t = rep(.06, 9)
  }
  estimates <- e_step(B, t)
  new_estimates <- m_step(estimates)
  
}

And if we check Rubin & Thayer (1982), we see that we our estimates are the same as theirs!

new_estimates
## $beta_max
##            [,1]       [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,]  0.3115840  0.3600271 0.6601979 0.6142955 0.2955415 0.3219204 0.3391404
## [2,]  0.2670720  0.3085947 0.5658840 0.5265390 0.2533213 0.2759317 0.2906917
## [3,] -0.4892112 -0.5088666 0.1757652 0.2462395 0.0000000 0.0000000 0.0000000
## [4,]  0.0000000  0.0000000 0.0000000 0.0000000 0.5662935 0.4252747 0.4707045
##            [,8]        [,9]
## [1,] 0.62069630  0.61473108
## [2,] 0.53202541  0.52691236
## [3,] 0.00000000  0.00000000
## [4,] 0.03858147 -0.07123802
## 
## $resid_max
## [1] 0.4850380 0.4158390 0.1994205 0.2729801 0.4270281 0.5112771 0.5241707
## [8] 0.3264132 0.3334939