Random Effects Model From Scratch

Random Effects Model by Hand

I use random effects models (aka mixed-effects models aka multi-level models aka hierarchical linear models) frequently in my research. Although I understood the intuition behind them for a long time, I was lost on the mechanics, so I decided to finally sit down and try to code one from scratch. The random effects model is given by the equation:

\(y_i = X_i\alpha + Z_ib_i + e_i\)

where \(X_i\) and \(Z_i\) are known \(n_i\) X \(p\) and \(n_i\) X \(q\) matrices, \(\alpha\) is a vector of fixed effects to be estimated and \(b_i\) and \(e_i\) are independent random vectors distributed as \(N(0, D)\) and \(N(0, \sigma^2I)\). D is a positive semidefinite \(q\) X \(q\) matrix of unknown parameters to be estimated, and \(I_i\) is the \(n_i\) X \(n_i\) identity matrix.

It follows that

\(\mathbb{E}(y_i) = X_i\alpha\)
\(var(y_i) = \Sigma_i\)
\(\Sigma_i = \sigma^2_i + Z_iDZ_i^T\)

The issue in evaluating a random effects model is that the \(b_i\) (and \(e_i\)) unobserved. Unlike a fixed effects that is constant across all units; \(b_i\) varies across each unit.

In the present example, I am going to use the sleepstudy data from the lme4 package. In this dataset, participants’ reaction times are measured over time.

We are going to estimate the following model:

\(y_i = X_i\alpha + Z_ib_i + e_i\)

where \(y_i\) = reaction time, \(X_i\) is a vector of 1’s, and \(Z_i\) is a vector of 1’s. In short, we are going to predict reaxction time from the mean reaction time and each participants’ mean reaction time. As such, we have to estimate \(b_i\) for every participant. However, as stated earlier \(b_i\) is unobserved. To solve the equation, we will use the EM algorithm.

The EM Algorithm

The E-M algorithm is a two-step algorithm that starts with the E-Step: an initial estimate of the mean vector and covariance matrix that predict the missing variables from the observed variables. The M-Step subsequently applies standard concrete data formulas to generate updated estimates of the mean vector and covariance matrix. These two-steps are repeated until the mean vector and covariance matrix no longer change between consecutive M-steps.

In the context of mixed-effects models, we estimate \(\sigma^2\) (variance of the residual) and D (covariance matrix of random effects).

First, we perform the E-Step, where we find the expectation of \(e_i^2\) and \(b_i^2\). The expecation of \(e_i^2\) is found via:

\(\mathbb{E}[\sum_{1}^{m} e_i^Te_i|y_i, \alpha(\hat{\theta}), \hat{\theta}]\)

= \(\sum_{1}^{m}\hat{e_i}^T\hat{e_i} + \text{ tr var }[e_i|y_i, \alpha(\hat{\theta}), \hat{\theta}]\)

= \(\sum_{1}^m[(y - X_i\alpha - Z_ib_i)^T(y - X_i\alpha - Z_ib_i) + \sigma^2tr(I - \sigma^2W_i)\)

where \(\hat{\theta}\) is the current estimate of \(e_i\) and \(b_i\), and \(W_i\) is the inverse of the variance of y

The expectation of \(b_i^2\) is found via:

\(\mathbb{E}[\sum_{1}^{m}b_ib_i^T|y_i, \alpha(\hat{\theta}), \hat{\theta}]\)

=\(\sum_{1}^{m}\hat{b_i(\theta)}\hat{b_i(\theta)^T} + var(b_i|y_i, \alpha(\hat{\theta}), \hat{\theta})\)

=\(\sum_{1}^m[\hat{b_i(\theta)}\hat{b_i(\theta)^T} + D *(I - Z_i^TWZ_iD)]\)

these expectations satisfy the E-step.

A Pit-Stop

Let’s break down these equations a bit, starting with

\(\sum_{1}^m[(y - X_i\alpha - Z_ib_i)^T(y - X_i\alpha - Z_ib_i) + \sigma^2tr(I - \sigma^2W_i)\)

This gives us the expectation of the square of the residuals. It is also referred to as a sufficient statistic because it contains all of the information you need to solve for to build out the mixed model equation.

Above, we said that the E-Step of the EM algorithm gives initial estimate of the mean vector and covariance matrix that predict the missing variables from the observed variables. From the quadratic form (i.e., \(e^2\)), we know that:

\(\mathbb{E}(e'e) = u'u + tr(V)\)

where V is the variance and u is the mean

The first part of the above equation (\((y - X_i\alpha - Z_ib_i)^T(y - X_i\alpha - Z_ib_i)\)) is clearly the square of the mean. But what about the second? It looks the variance, but some more detail is needed.

In this bivariate analysis, we have the following distibutions:

\[ \left(\begin{array}{c} y \\ e \end{array}\right) \text{~} \text{N} [ \left(\begin{array}{c} \bar{y}\\ \bar{e} \end{array}\right) \text{,} \left(\begin{array}{cc} V11 & V12\\ V21 & V22 \end{array}\right) ] \]

and the conditional distribution of e given y is:

\(y|e \text{~} N[\bar{y} + V_{12}V_{22}^{-1}(e - \bar{e}), W_{11}^{-1}]\)

where \(W_{11}^{-1}\) = \((V_{11} - V_{12}V_{22}^{-1}V_{21})^{-1}\)

The expression for the mean looks a little strange, but it’s just a regression predicting y from e (Proof is below):

x = rnorm(10000, 0, 5)
e = rnorm(10000, 0, .25)

y = .5 * x + e

lm.mod = lm(y ~ e)
summary(lm.mod)

## 
## Call:
## lm(formula = y ~ e)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6293 -1.6760  0.0318  1.6787  9.0365 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01545    0.02515  -0.614    0.539    
## e            0.88120    0.10074   8.747   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.515 on 9998 degrees of freedom
## Multiple R-squared:  0.007595,   Adjusted R-squared:  0.007495 
## F-statistic: 76.51 on 1 and 9998 DF,  p-value: < 2.2e-16

predicted = (mean(y) + (cov(y,e)/var(e)) * (e - mean(e)))

stopifnot(all(round(as.numeric(predicted), 6) == round(as.numeric(predict(lm.mod)), 6)))

The variance is also tough to parse, but it’s just the residual variance! (Proof below)

stopifnot(round((var(y) - (cov(e,y) * 1/var(e) * cov(e, y))), 4) == round(var(resid(lm.mod)), 4))

Now, let’s go back to the expectation for \(b_i^2\):

= \(\sum_{1}^m[\hat{b_i(\theta)}\hat{b_i(\theta)^T} + D *(I - Z_i^TWZ_iD)]\)

This is nearly the same as \(\sum_{1}^m[(y - X_i\alpha - Z_ib_i)^T(y - X_i\alpha - Z_ib_i) + \sigma^2tr(I - \sigma^2W_i)\), the only differences are that the random effects are multivariate normal, so D estimates their covariance, and Z indicates the design of the random effects.

So, we can see that how the EM algorithm predicts the missing variables from the observed variables; the mean is the prediction of the regression of y on e, and the variance is the residual variance of the regression of y on e.

Back to the EM-Algorithm

After the E-step, the M-Step is simple: We simply take the mean of \(e_i^2\) over all units to get the residual variance, and we take the mean of \(b_i^2\) over all units to get the covariance matrix of the random effects.

The Full Set of Equations

\(\alpha = (\sum_{1}^mX_i^TW_iX_i)^{-1}(\sum_{1}^mX_i^TW_iy_i)\)

So \(\alpha\) is just the normal least squares estimator of \(\alpha\).

\(W_i = \Sigma_i^{-1}\)

\(\Sigma_i = \sigma^2 + Z_iDZ_i^T\)

So \(\Sigma\) is the variance of y.

\(b_i = DZ_i^TW_ir_i\)

and

\(r_i = y_i - X_i\alpha\)

\(\sigma^2 = [\sum_{1}^m[(y - X_i\alpha - Z_ib_i)^T(y - X_i\alpha - Z_ib_i) + \sigma^2tr(I - \sigma^2W_i)]/N\)

\(D = [\sum_{1}^m[\hat{b_i(\theta)}\hat{b_i(\theta)^T} + D *(I - Z_i^TWZ_iD)]]/m\)

Shrinkage

In mixed-effects models, the random effects are often said to exhibit “shrinkage to the mean”, also referred to as empirical bayes, also referred to as partial pooling, meaning that for random effects without many observations are pulled towards the mean. With these equations, we can see how this occurs. Code is below.

In this code, I am going to simulate a multilevel model, where the only thing that changes is the size of the cluster.

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## Warning: package 'dplyr' was built under R version 3.6.2

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# level 1 variance
D = matrix(1)
# level 2 variance
sigma = 1

# here is where i change the size of the cluster
n_times_list = seq(2, 21, by = 1)

# going to store the response variable here
ys = list()
# variance of the response var
Vs = list()
# inverse of the variance of the response var here
Ws = list()
# resiudal from the fixed effects
rs = list()
# design matrix for the random matrix
Zs = list()
# store ids
ids = list()

for (j in 1:20){
  
  # cluster size
  n_times = n_times_list[j]
  
  # save ids
  ids[[j]] = rep(j, n_times)
  
  # design matrix
  Z = rep(1, n_times)
  Zs[[j]] = Z
  
  # i want variance in the random effects,
  # so its either 1 or -1
  if (j %% 2 == 0){
    u = 1
  }else{
    u = -1
  }  
  # only estimate intercept
  X =  rep(1, n_times)
  
  # identity matrix
  I = diag(n_times)
  
  # variance of y 
  V = sigma * I + Z %*% D %*% t(Z)
  Vs[[j]] = V
  # inverse variance of y
  W =  solve(V) 
  Ws[[j]] = W
  
  # error 
  e = seq(-1, 1, by = 2)
  e = rep(e, n_times)
  e = e[1:n_times]
  
  # full equation
  y =  X * .5 + e + Z %*% t(u)
  
  ys[[j]]= y

  # residual from ols
  r = y - X * .5
  rs[[j]] = r
  
  # random effect
  b = D %*% t(Z) %*% W %*% r

  
  
  
}

Let’s check to make sure we simulated the data correctly.

library(lme4)

## Warning: package 'lme4' was built under R version 3.6.2

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

library(sjPlot)

## Warning: package 'sjPlot' was built under R version 3.6.2

y = unlist(ys)
ids = unlist(ids)

data = tibble(y, ids)

lmer.mod = lmer(y ~ 1 + (1|ids), data = data, REML = F)
tab_model(lmer.mod)

	y
Predictors	Estimates	CI	p
(Intercept)	0.46	0.04 – 0.87	0.033
Random Effects
σ²	1.09
τ₀₀ _ids	0.78
ICC	0.42
N _ids	20
Observations	230
Marginal R² / Conditional R²	0.000 / 0.418

D and sigma were set to 1, which line up with the intercept and residual variance, respectively. Also, the fixed-effect intercept (grand-mean) was set to .5, which checks out with the model.

Okay, now we can investigate the partial pooling.

The equation for the random effects is

\(b_i = DZ_i^TW_ir_i\)

We get this equation from the conditional distrubtion of b given y (see Morrison, 2005; pg. 16), but the equation for the expectation of a variable (x1) given another (x2) is:

\(u_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - u_2)\)

so in our example, we have

\(DZ_i^TW_ir_i\)

As stated earlier, the expectation of \(b_i\) is 0, so our estimate of \(b_i\) is a weighted combination of the prior mean (0) and the posterior mean (\(DZ_i^TW_ir_i\)). Since the prior mean is 0, it’s not really getting a weight – we don’t actually give 0 any extra weight, we just ignore it. However, we can see how the estimate for \(b_i\) is shrunk towards 0 based on the variance of y. As the variance of y gets larger, the estimate for \(b_i\) gets smaller.

Let’s see how this works below. We start with a group with two datapoints.

# get the design effect from group 1
Z = Zs[[1]]
# group 1 has 2 entries
I = diag(2)
# the variance of the group intercept is .78
D = matrix(0.7801)
# the variance of the residual is 1.0878
sigma = 1.0878
# our equation for the variance of y
V = sigma * I + Z %*% D %*% t(Z) 
# inverse variance
W = solve(V)
# get y 
y = data$y[data$ids == '1']
# substract from fixed effect (grand intercept from model)
r = y - .4559

D %*% t(Z) %*% W %*% r

##            [,1]
## [1,] -0.5632157

Let’s look at a group with 4 datapoints. Remember, we simulated this data so the group intercept is identical in the odd and even cases.

# get the design effect from group 3
Z = Zs[[3]]
# group 2 has 4 entries
I = diag(4)
# the variance of the group intercept is .78
D = matrix(0.7801)
# the variance of the residual is 1.0878
sigma = 1.0878
# our equation for the variance of y
V = sigma * I + Z %*% D %*% t(Z) 
# inverse variance
W = solve(V)
# get y 
y = data$y[data$ids == '3']
# substract from fixed effect (grand intercept from model)
r = y - .4559

D %*% t(Z) %*% W %*% r

##            [,1]
## [1,] -0.7088043

What we do here is multiply D by each element of the design matrix D and then multiply by the inverse variance of y, so our estimate gets smaller the larger the variance of y becomes. Finally, we multiply by the residual from the fixed effect to get our estimate of b, which in our case is how far off the group mean is away from the overall mean.

This is probably the most important part of the mixed effects model. We are fitting a linear regression on the error for each individual group, while shrinking that estimate relative to the variance of y. This also happens to be why economists dislike mixed effects models for studying causal effects: a mixed effects model further partitions the error variance into separate parts. It does not partial out the effect of each individual group so that the causal effect of interest can be isolated.

Let’s see how these results compare with the lmer. The ranef function gives us deviation from the overall intercept.

ranef(lmer.mod)[[1]][[1]][3]

## [1] -0.7088351

We’re in agreement!

Solving a Model

And finally, we solve a model ourselves. We’ll use the sleepstudy data from lme4.

library(lme4)
data('sleepstudy')
sleepstudy$id = as.factor(as.numeric(sleepstudy$Subject))

N_subjects = 18
N_effects = 1
N_obs = 10

# initial guesses for our estimates
alpha = 0 
error_var = 1
random_cov = diag(1)




for (j in 1:100){
  
  error_var_list = c()
  random_cov_list = c()
  cov_list = c()
  var_list = c()
  
  for (i in as.numeric(unique(sleepstudy$id))){
    

    sleep_i = sleepstudy[sleepstudy$id == i,]
    
    
    y = sleep_i$Reaction
    x = rep(1, nrow(sleep_i)) 
    Z = rep(1, nrow(sleep_i))
    I = diag(nrow(sleep_i))

    
    V = error_var * I + Z %*% random_cov %*% t(Z)
    W =  solve(V)
    
    # calculate residual from ols
    r = y - (x) %*% t(alpha)
    
    # now calculate random intercept (b)
    b = random_cov %*% t(Z) %*% W %*% (r)
    
    # now can calculate e 
    e = r - (Z %*% b)
    
    # stash cov
    cov_list = c(cov_list, t(x) %*% W %*% y)
    # stash variance
    var_list = c(var_list, t(x) %*% W %*% x)
    
    # stash error
    error_var_list = c(error_var_list, t(e) %*% e + error_var * sum(diag(I - error_var * W)))
    
    # stash d
    random_cov_list = c(random_cov_list, 
                        b %*% t(b) + random_cov * (diag(1) - t(Z) %*% W %*% Z %*% random_cov))

  }
  
  # recalculate alpha
  alpha = solve(sum(var_list)) %*% sum(cov_list)
  # and recalculate error_var
  error_var = sum(error_var_list)/nrow(sleepstudy)
  # and recalculate random_cov
  random_cov = diag(sum(random_cov_list)/length(random_cov_list), 1)
  
}


lmer.mod = lmer(Reaction ~ 1 + (1|id), data = sleepstudy, REML = F)
tab_model(lmer.mod)

	Reaction
Predictors	Estimates	CI	p
(Intercept)	298.51	281.27 – 315.75	<0.001
Random Effects
σ²	1958.87
τ₀₀ _id	1196.44
ICC	0.38
N _id	18
Observations	180
Marginal R² / Conditional R²	0.000 / 0.379

Let’s see how our results compare against the results of lmer to see how we did.

random_cov

##          [,1]
## [1,] 1196.436

error_var

## [1] 1958.865

Perfect! I hope this has been a useful tutorial on how mixed effects models work.