In a recent post I pitted listedwise deletion against Full Information Maximum Likelihood (FIML) to see which outperformed which in an instrumental variables analysis (listwise deletion won). However, a big caveat of that analysis was that I didn’t use FIML to generate predicted values of x because lavaan can’t produce predicted values with incomplete data. So, this post is the same analysis, but with multiple imputation instead of FIML so that we can generated predicted values of x with the incomplete data. Analysis below:

``````{% raw %}

library(ggplot2)
library(lavaan)
library(plyr)
library(mice)

true_est <- NULL
mi_est <- NULL
miss_est <- NULL

set.seed(35)

#generate the instrument

i <- rnorm(2000, sd = 1)

# generate the covariate

c <- rnorm(2000, sd = 1)

# generate x. it has a moderate relation with c and i

x <- 0 + .3 * c + .3 * i + rnorm(2000, sd = .5)

# generate y. it has a moderate relation with c and x.

y <- 0 + .3 * c + .3 * x + rnorm(2000, sd = .5)

# create the data frame

dat <- data.frame(i, x, c, y)

# now get the predicted values of x, controlling for c

pred <- predict(lm(x ~ i + c, data = dat))

dat\$pred <- pred

# and predict y from the predicted values of x, controlling for c

mod.iv <- lm(y ~ pred + c, data = dat)

# store the true estimate

true_est <- c(true_est, coef(mod.iv)['pred'])

# now remove values

dat <- dat[,names(dat) != 'pred']

for (j in 5:1995)
{

dat_miss <- dat

dat_miss[sample(nrow(dat), j),][,c('c')] <- NA

# predict x from i, controlling for c

pred <- predict(lm(x ~ i + c, dat_missa = dat_miss))

# get ready to merge the predicted values into the dat_missaset

pred <- data.frame(names(pred), pred)

names(pred) <- c('ID', 'pred')

dat_miss\$ID <- rownames(dat_miss)

# merge the predicted values

dat_miss <- join(dat_miss, pred, by='ID')

# predict y from the predicted values, controlling for c. this will listwise delete cases with missing values

mod.miss <- lm(y ~ pred + c, data = dat_miss)

# store the estimate

miss_est <- c(miss_est, coef(mod.miss)['pred'])

# create an imputed dataset

dat_miss.mi <- dat_miss[, ! names(dat_miss) %in% c('pred', 'ID')]

# impute the dataset

dat_miss.mi <- mice(dat_miss.mi)

# make a completed dataset

dat_miss.mi <- complete(dat_miss.mi)

# get predicted x values

pred <- predict(lm(x ~ i + c, data = dat_miss.mi))

dat\$pred <- pred

# and predict y from the predicted values of x, controlling for c

mod.iv <- lm(y ~ pred + c, data = dat)

# store the mi estimate

mi_est <- c(mi_est, coef(mod.iv)['pred'])

}

frame <- data.frame(miss_est, mi_est)

frame\$true_est <- true_est

frame\$dist_miss <- frame\$true_est - frame\$miss_est

frame\$dist_mi <- frame\$true_est - frame\$mi_est

frame\$x <- 5:1995

{% endraw %}
``````

Now we can plot the distance from the true estimate for multiple imputation and listwise deletion. Listwise deletion is in red, and multiple imputation is in blue. The y axis is the distance from the true estimate, and the x axis is the number of missing data points.

With this one, we see that multiple imputation outperforms listwise deletion. So, the takeaway is that a procedure to handle missing data outperforms listwise deletion, but only if you use that procedure to get your predicted x values, in addition to predicting y.