Analyzing Imputed Data with Multilevel Models and merTools

Jared Knowles

2018-06-04

Introduction

Multilevel models are valuable in a wide array of problem areas that involve non-experimental, or observational data. In many of these cases the data on individual observations may be incomplete. In these situations, the analyst may turn to one of many methods for filling in missing data depending on the specific problem at hand, disciplinary norms, and prior research.

One of the most common cases is to use multiple imputation. Multiple imputation involves fitting a model to the data and estimating the missing values for observations. For details on multiple imputation, and a discussion of some of the main implementations in R, look at the documentation and vignettes for the mice and Amelia packages.

The key difficulty multiple imputation creates for users of multilevel models is that the result of multiple imputation is K replicated datasets corresponding to different estimated values for the missing data in the original dataset.

For the purposes of this vignette, I will describe how to use one flavor of multiple imputation and the function in merTools to obtain estimates from a multilevel model in the presence of missing and multiply imputed data.

Missing Data and its Discontents

To demonstrate this workflow, we will use the hsb dataset in the merTools package which includes data on the math achievement of a wide sample of students nested within schools. The data has no missingness, so first we will simulate some missing data.

data(hsb)

# Create a function to randomly assign NA values

add_NA <- function(x, prob){
  z <- rbinom(length(x), 1, prob = prob)
  x[z==1] <- NA
  return(x)
}

hsb$minority <- add_NA(hsb$minority, prob = 0.05)
table(is.na(hsb$minority))
#> 
#> FALSE  TRUE 
#>  6836   349

hsb$female <- add_NA(hsb$female, prob = 0.05)
table(is.na(hsb$female))
#> 
#> FALSE  TRUE 
#>  6830   355

hsb$ses <- add_NA(hsb$ses, prob = 0.05)
table(is.na(hsb$ses))
#> 
#> FALSE  TRUE 
#>  6821   364

hsb$size <- add_NA(hsb$size, prob = 0.05)
table(is.na(hsb$size))
#> 
#> FALSE  TRUE 
#>  6848   337
# Load imputation library
library(Amelia)
# Declare the variables to include in the imputation data
varIndex <- names(hsb)
# Declare ID variables to be excluded from imputation
IDS <- c("schid", "meanses")
# Imputate
impute.out <- amelia(hsb[, varIndex], idvars = IDS, 
                         noms = c("minority", "female"), 
                         m = 5)
#> -- Imputation 1 --
#> 
#>   1  2  3
#> 
#> -- Imputation 2 --
#> 
#>   1  2  3
#> 
#> -- Imputation 3 --
#> 
#>   1  2  3
#> 
#> -- Imputation 4 --
#> 
#>   1  2  3
#> 
#> -- Imputation 5 --
#> 
#>   1  2  3
summary(impute.out)
#> 
#> Amelia output with 5 imputed datasets.
#> Return code:  1 
#> Message:  Normal EM convergence. 
#> 
#> Chain Lengths:
#> --------------
#> Imputation 1:  3
#> Imputation 2:  3
#> Imputation 3:  3
#> Imputation 4:  3
#> Imputation 5:  3
#> 
#> Rows after Listwise Deletion:  5865 
#> Rows after Imputation:  7185 
#> Patterns of missingness in the data:  13 
#> 
#> Fraction Missing for original variables: 
#> -----------------------------------------
#> 
#>          Fraction Missing
#> schid          0.00000000
#> minority       0.04857342
#> female         0.04940849
#> ses            0.05066110
#> mathach        0.00000000
#> size           0.04690327
#> schtype        0.00000000
#> meanses        0.00000000

Fitting and Summarizing a Model List

Fitting a model is very similar

fmla <- "mathach ~ minority + female + ses + meanses + (1 + ses|schid)"
mod <- lmer(fmla, data = hsb)
modList <- lmerModList(fmla, data = impute.out$imputations)

The resulting object modList is a list of merMod objects the same length as the number of imputation datasets. This object is assigned the class of merModList and merTools provides some convenience functions for reporting the results of this object.

Using this, we can directly compare the model fit with missing data excluded to the aggregate from the imputed models:

fixef(mod) # model with dropped missing
#> (Intercept)    minority      female         ses     meanses 
#>   14.042814   -2.701367   -1.182849    1.899925    2.992339
fixef(modList)
#> (Intercept)    minority      female         ses     meanses 
#>   13.966791   -2.507946   -1.191014    1.919880    3.228822
VarCorr(mod) # model with dropped missing
#>  Groups   Name        Std.Dev. Corr  
#>  schid    (Intercept) 1.53617        
#>           ses         0.64942  -0.566
#>  Residual             6.00409
VarCorr(modList) # aggregate of imputed models
#> $stddev
#> $stddev$schid
#> (Intercept)         ses 
#>    1.520475    0.659457 
#> 
#> 
#> $correlation
#> $correlation$schid
#>             (Intercept)        ses
#> (Intercept)   1.0000000 -0.5147035
#> ses          -0.5147035  1.0000000

If you want to inspect the individual models, or you do not like taking the mean across the imputation replications, you can take the merModList apart easily:

lapply(modList, fixef)
#> $imp1
#> (Intercept)    minority      female         ses     meanses 
#>   14.000272   -2.681289   -1.174556    1.889865    3.185328 
#> 
#> $imp2
#> (Intercept)    minority      female         ses     meanses 
#>   13.976422   -2.474407   -1.216113    1.926781    3.204234 
#> 
#> $imp3
#> (Intercept)    minority      female         ses     meanses 
#>   13.969942   -2.458477   -1.213627    1.940791    3.232341 
#> 
#> $imp4
#> (Intercept)    minority      female         ses     meanses 
#>   13.938151   -2.391582   -1.184232    1.907684    3.301779 
#> 
#> $imp5
#> (Intercept)    minority      female         ses     meanses 
#>   13.949169   -2.533975   -1.166541    1.934279    3.220429

And, you can always operate on any single element of the list:

fixef(modList[[1]])
#> (Intercept)    minority      female         ses     meanses 
#>   14.000272   -2.681289   -1.174556    1.889865    3.185328
fixef(modList[[2]])
#> (Intercept)    minority      female         ses     meanses 
#>   13.976422   -2.474407   -1.216113    1.926781    3.204234

Output of a Model List

print(modList)
#> [1] "Linear mixed model fit by REML"
#> Model family: 
#> lmer(formula = mathach ~ minority + female + ses + meanses + 
#>     (1 + ses | schid), data = d)
#> 
#> Fixed Effects:
#>             estimate std.error statistic         df
#> (Intercept)   13.967     0.174    80.318 378818.246
#> female        -1.191     0.159    -7.479 416213.147
#> meanses        3.229     0.360     8.966 144227.982
#> minority      -2.508     0.210   -11.924   1342.313
#> ses            1.920     0.121    15.891 334071.727
#> 
#> Random Effects:
#> 
#> Error Term Standard Deviations by Level:
#> 
#> schid
#> (Intercept)         ses 
#>       1.520       0.659 
#> 
#> 
#> Error Term Correlations:
#> 
#> schid
#>             (Intercept) ses   
#> (Intercept)  1.000      -0.515
#> ses         -0.515       1.000
#> 
#> 
#> Residual Error = 5.988 
#> 
#> ---Groups
#> number of obs: 7185, groups: schid, 160
#> 
#> Model Fit Stats
#> AIC = 46374.6
#> Residual standard deviation = 5.988
summary(modList)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathach ~ minority + female + ses + meanses + (1 + ses | schid)
#>    Data: d
#> 
#> REML criterion at convergence: 46343.5
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -3.2140 -0.7239  0.0350  0.7628  2.9151 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev. Corr 
#>  schid    (Intercept)  2.3529  1.5339        
#>           ses          0.4588  0.6773   -0.47
#>  Residual             35.7671  5.9806        
#> Number of obs: 7185, groups:  schid, 160
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)  14.0003     0.1741  80.420
#> minority     -2.6813     0.2000 -13.408
#> female       -1.1746     0.1590  -7.389
#> ses           1.8899     0.1210  15.620
#> meanses       3.1853     0.3612   8.819
#> 
#> Correlation of Fixed Effects:
#>          (Intr) minrty female ses   
#> minority -0.318                     
#> female   -0.481  0.013              
#> ses      -0.210  0.144  0.041       
#> meanses  -0.094  0.124  0.024 -0.231
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathach ~ minority + female + ses + meanses + (1 + ses | schid)
#>    Data: d
#> 
#> REML criterion at convergence: 46364
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -3.2124 -0.7202  0.0367  0.7565  2.9167 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev. Corr 
#>  schid    (Intercept)  2.3179  1.5225        
#>           ses          0.3793  0.6159   -0.60
#>  Residual             35.9233  5.9936        
#> Number of obs: 7185, groups:  schid, 160
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)  13.9764     0.1735  80.542
#> minority     -2.4744     0.1989 -12.443
#> female       -1.2161     0.1589  -7.652
#> ses           1.9268     0.1191  16.176
#> meanses       3.2042     0.3567   8.984
#> 
#> Correlation of Fixed Effects:
#>          (Intr) minrty female ses   
#> minority -0.319                     
#> female   -0.485  0.015              
#> ses      -0.236  0.136  0.047       
#> meanses  -0.101  0.123  0.021 -0.239
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathach ~ minority + female + ses + meanses + (1 + ses | schid)
#>    Data: d
#> 
#> REML criterion at convergence: 46346.2
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -3.2167 -0.7297  0.0396  0.7633  2.9230 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev. Corr 
#>  schid    (Intercept)  2.2662  1.5054        
#>           ses          0.4036  0.6353   -0.55
#>  Residual             35.8290  5.9857        
#> Number of obs: 7185, groups:  schid, 160
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)  13.9699     0.1722  81.129
#> minority     -2.4585     0.1975 -12.445
#> female       -1.2136     0.1584  -7.661
#> ses           1.9408     0.1193  16.267
#> meanses       3.2323     0.3549   9.109
#> 
#> Correlation of Fixed Effects:
#>          (Intr) minrty female ses   
#> minority -0.318                     
#> female   -0.485  0.012              
#> ses      -0.228  0.146  0.048       
#> meanses  -0.096  0.116  0.021 -0.238
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathach ~ minority + female + ses + meanses + (1 + ses | schid)
#>    Data: d
#> 
#> REML criterion at convergence: 46379.7
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -3.2050 -0.7232  0.0327  0.7602  2.9187 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev. Corr 
#>  schid    (Intercept)  2.2941  1.5146        
#>           ses          0.4589  0.6774   -0.51
#>  Residual             35.9703  5.9975        
#> Number of obs: 7185, groups:  schid, 160
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)  13.9382     0.1732  80.464
#> minority     -2.3916     0.1983 -12.059
#> female       -1.1842     0.1588  -7.459
#> ses           1.9077     0.1212  15.743
#> meanses       3.3018     0.3576   9.233
#> 
#> Correlation of Fixed Effects:
#>          (Intr) minrty female ses   
#> minority -0.321                     
#> female   -0.483  0.012              
#> ses      -0.216  0.138  0.038       
#> meanses  -0.098  0.122  0.025 -0.233
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathach ~ minority + female + ses + meanses + (1 + ses | schid)
#>    Data: d
#> 
#> REML criterion at convergence: 46349.7
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -3.1958 -0.7228  0.0332  0.7635  2.9254 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev. Corr 
#>  schid    (Intercept)  2.3287  1.5260        
#>           ses          0.4781  0.6914   -0.44
#>  Residual             35.7935  5.9828        
#> Number of obs: 7185, groups:  schid, 160
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)  13.9492     0.1736  80.350
#> minority     -2.5340     0.1995 -12.703
#> female       -1.1665     0.1587  -7.348
#> ses           1.9343     0.1214  15.933
#> meanses       3.2204     0.3607   8.927
#> 
#> Correlation of Fixed Effects:
#>          (Intr) minrty female ses   
#> minority -0.319                     
#> female   -0.482  0.017              
#> ses      -0.201  0.134  0.047       
#> meanses  -0.091  0.126  0.023 -0.231
fastdisp(modList)
#> lmer(formula = mathach ~ minority + female + ses + meanses + 
#>     (1 + ses | schid), data = d)
#>             estimate std.error
#> (Intercept)    13.97      0.17
#> female         -1.19      0.16
#> meanses         3.23      0.36
#> minority       -2.51      0.21
#> ses             1.92      0.12
#> 
#> Error terms:
#>  Groups   Name        Std.Dev. Corr  
#>  schid    (Intercept) 1.52           
#>           ses         0.66     -0.47 
#>  Residual             5.99           
#> ---
#> number of obs: 7185, groups: schid, 160
#> AIC = 46374.6

The standard errors reported for the model list include a correction, Rubin’s correction (see documentation), which adjusts for the within and between imputation set variance as well.

Specific Model Information Summaries

modelRandEffStats(modList)
#>                        term    group   estimate   std.error
#> 1 cor_(Intercept).ses.schid    schid -0.5147035 0.065851800
#> 2      sd_(Intercept).schid    schid  1.5204747 0.010929919
#> 3   sd_Observation.Residual Residual  5.9880388 0.007245653
#> 4              sd_ses.schid    schid  0.6594570 0.032203227
modelFixedEff(modList)
#>          term  estimate std.error  statistic         df
#> 1 (Intercept) 13.966791 0.1738931  80.318271 378818.246
#> 2      female -1.191014 0.1592574  -7.478547 416213.147
#> 3     meanses  3.228822 0.3601076   8.966271 144227.982
#> 4    minority -2.507946 0.2103196 -11.924455   1342.313
#> 5         ses  1.919880 0.1208161  15.890936 334071.727
VarCorr(modList)
#> $stddev
#> $stddev$schid
#> (Intercept)         ses 
#>    1.520475    0.659457 
#> 
#> 
#> $correlation
#> $correlation$schid
#>             (Intercept)        ses
#> (Intercept)   1.0000000 -0.5147035
#> ses          -0.5147035  1.0000000

Diagnostics of List Components

Let’s apply this to our model list.

Cautions and Notes

Often it is desirable to include aggregate values in the level two or level three part of the model such as level 1 SES and level 2 mean SES for the group. In cases where there is missingness in either the level 1 SES values, or in the level 2 mean SES values, caution and careful thought need to be given to how to proceed with the imputation routine.