# How to Evaluate Models

#### 2018-10-30

A big part of the purpose of idealstan is to give people different options in fitting ideal point models to diverse data. Along with that, idealstan makes use of Bayesian model evaluation a la loo and also can analyze the posterior predictive distribution using bayesplot. loo is an approximation of the predictive error of a Bayesian model via leave-one-out cross-validation (LOO-CV). True LOO-CV on Bayesian models is computationally prohibitive because it involves estimating a new model for each data point. For IRT models incorporating thousands or even millions of observations, this is practically infeasible.

bayesplot allows us to analyze the data we used to estimate the model compared to data produced by the model, or what is called the posterior predictive distribution. This is very useful as a general summary of model fit to see whether there are values of the outcome that we are over or under predicting.

idealstan implements functions for each ideal point model that calculate the log-posterior probability of the data, which is the necessary input to use loo’s model evaluation features. This vignette demonstrates the basic usage. I also discuss best practices for evaluating models.

We first begin by simulating data for a standard IRT 2-PL ideal point model but with strategically missing data:

irt_2pl <- id_sim_gen(inflate=TRUE)
## Warning: package 'bindrcpp' was built under R version 3.4.4

We can then fit two ideal point models to the same data, one that uses the correct binomial model for a 0/1 binary outcome, and the second that tries to fit a Poisson count model to this same binary data.

# Because of CRAN limitations, only using 2 cores & 2 chains
irt_2pl_correct <- id_estimate(idealdata=irt_2pl,
model_type=2,
restrict_ind_high = as.character(sort(irt_2pl@simul_data$true_person, decreasing=TRUE, index=TRUE)$ix[1]),
restrict_ind_low = as.character(sort(irt_2pl@simul_data$true_person, decreasing=FALSE, index=TRUE)$ix[1]),
fixtype='vb_partial',
ncores=2,
nchains=2,
niters = 500)
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Iteration:   1 / 250 [  0%]  (Adaptation)
## Chain 1: Iteration:  50 / 250 [ 20%]  (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%]  (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%]  (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%]  (Adaptation)
## Chain 1: Success! Found best value [eta = 1] earlier than expected.
## Chain 1:    100        -1057.420             1.000            1.000
## Chain 1:    200        -1045.939             0.505            1.000
## Chain 1:    300        -1044.276             0.338            0.011
## Chain 1:    400        -1041.217             0.254            0.011
## Chain 1:    500        -1042.919             0.203            0.003   MEDIAN ELBO CONVERGED
## Chain 1: Drawing a sample of size 1000 from the approximate posterior...
irt_2pl_incorrect <- id_estimate(idealdata=irt_2pl,
model_type=8,
restrict_ind_high = as.character(sort(irt_2pl@simul_data$true_person, decreasing=TRUE, index=TRUE)$ix[1]),
restrict_ind_low = as.character(sort(irt_2pl@simul_data$true_person, decreasing=FALSE, index=TRUE)$ix[1]),
fixtype='vb_partial',
ncores=2,
nchains=2,
niters=500)
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Iteration:   1 / 250 [  0%]  (Adaptation)
## Chain 1: Iteration:  50 / 250 [ 20%]  (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%]  (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%]  (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%]  (Adaptation)
## Chain 1: Success! Found best value [eta = 1] earlier than expected.
## Chain 1:    100     -4154172.311             1.000            1.000
## Chain 1:    200      -462000.315             4.496            7.992
## Chain 1:    300     -1605796.824             3.235            1.000
## Chain 1:    400     -9960842.290             2.636            1.000
## Chain 1:    500     -3660771.278             2.453            1.000
## Chain 1:    600  -208653028151.058             2.211            1.000
## Chain 1:    700   -871621239.323            35.950            1.000
## Chain 1:    800    -25862319.457            35.544            1.721
## Chain 1:    900    -47171459.169            31.645            1.000
## Chain 1:   1000    -13591951.336            28.727            1.721
## Chain 1:   1100     -1871355.503            29.254            2.471   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   1200  -1104070282.330            28.554            1.721   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   1300  -92502391119.340            28.582            1.721   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   1400      -563994.875         16429.680            2.471   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   1500    -49620069.571         16429.607            2.471   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   1600        -8661.581         17002.282            6.263   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   1700        -1810.391         16978.822            3.784   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   1800        -2552.708         16975.581            2.471   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   1900        -1682.214         16975.588            2.471   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   2000        -1659.393         16975.342            0.998   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   2100        -1588.202         16974.720            0.989   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   2200        -1558.193         16974.622            0.988   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   2300        -1547.898         16974.524            0.517   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   2400        -1546.471           573.342            0.291   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   2500        -1547.006           573.243            0.045   MAY BE DIVERGING... INSPECT ELBO
## Chain 1:   2600        -1548.441             0.468            0.019
## Chain 1:   2700        -1547.860             0.090            0.014
## Chain 1:   2800        -1546.677             0.061            0.007   MEDIAN ELBO CONVERGED
## Chain 1: Drawing a sample of size 1000 from the approximate posterior...

The first thing we want to check with any MCMC model is convergence. An easy way to check is by looking at the Rhat distributions. If all these values are below 1.1, then we have good reason to believe that the model converged, and we can get these distributions with the id_plot_rhats function:

id_plot_rhats(irt_2pl_correct)
## stat_bin() using bins = 30. Pick better value with binwidth.

id_plot_rhats(irt_2pl_incorrect)
## stat_bin() using bins = 30. Pick better value with binwidth.

We can only assume that the first model will converge. If the second model failed the Rhat test, at this point we should go back and see if the model is miss-specified or there is something wrong with the data. But for the sake of illustration, we will look at other diagnostics. We can also examine whether 1) the models are able to replicate the data they were fitted on accurately and 2) overall measures of model fit.

We can first look at how well the model reproduces the data, which is called the posterior predictive distribution. We can obtain these distributions using the id_post_pred function:

post_correct <- id_post_pred(irt_2pl_correct)
## [1] "Processing posterior replications for 1000 scores using 100 posterior samples out of a total of 500 samples."
post_incorrect <- id_post_pred(irt_2pl_incorrect)
## [1] "Processing posterior replications for 1000 scores using 100 posterior samples out of a total of 500 samples."

What we can do is the use a wrapper around the bayesplot package called id_plot_ppc to see how well these models replicate their own data. These plots show the original data as blue bars and the posterior predictive estimate as an uncertainty interval. If the interval overlaps with the observed data, then the model was able to replicate the observed data, which is a basic and important validity test for the model, i.e., could the model have generated the data that was used to fit it?

id_plot_ppc(irt_2pl_correct,ppc_pred=post_correct)

id_plot_ppc(irt_2pl_incorrect,ppc_pred=post_incorrect)

It is stupidly obvious from these plots that one should not use a Poisson distribution for a Bernoulli (binary) outcome as it will predict values as high as 10 (although it still puts the majority of the density on 1 and 2). Only two observed values are showing on the Poisson predictive distribution because the missing data model must be shown separately if the outcome is unbounded. To view the missing data model (i.e., a binary response), simply change the output option in id_post_pred to 'missing'. We can also look at particular persons or items to see how well the models predict those persons or items. For example, let’s look at the first two persons in the simulated data for which we incorrectly used the Poisson model by passing their IDs as a character vector to the group option:

id_plot_ppc(irt_2pl_incorrect,ppc_pred=post_incorrect,group=c('1','2'))

Finally, we can turn to summary measures of model fit that also allow us to compare models directly to each other (if they were fit on the same data). To do so, I first employ the log_lik option of the id_post_pred function to generate log-likelihood values for each of these models.

log_lik_irt_2pl_correct <- id_post_pred(irt_2pl_correct,type='log_lik')
## [1] "Processing posterior replications for 1000 scores using 500 posterior samples out of a total of 500 samples."
log_lik_irt_2pl_incorrect <- id_post_pred(irt_2pl_incorrect,type='log_lik')
## [1] "Processing posterior replications for 1000 scores using 500 posterior samples out of a total of 500 samples."

Note that we must also specify the relative_eff function in order to calculate degrees of freedom. The first argument to relative_eff is the exponentiated log-likelihood matrix we calculated previously. The second argument uses the function derive_chain and also takes the log-likelihood matrix as its argument.

I put in an option for two cores, but a larger number could be used which would improve the speed of the calculations.

correct_loo <- loo(log_lik_irt_2pl_correct,
cores=2,
r_eff=relative_eff(exp(log_lik_irt_2pl_correct),
chain_id=derive_chain(log_lik_irt_2pl_correct)))
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
incorrect_loo <- loo(log_lik_irt_2pl_incorrect,
cores=2,
r_eff=relative_eff(exp(log_lik_irt_2pl_incorrect),
chain_id=derive_chain(log_lik_irt_2pl_incorrect)))
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
print(correct_loo)
##
## Computed from 500 by 1000 log-likelihood matrix
##
##          Estimate   SE
## elpd_loo   -630.2 14.8
## p_loo       105.4  3.8
## looic      1260.4 29.6
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     936   93.6%   106
##  (0.5, 0.7]   (ok)        59    5.9%   73
##    (0.7, 1]   (bad)        5    0.5%   48
##    (1, Inf)   (very bad)   0    0.0%   <NA>
## See help('pareto-k-diagnostic') for details.
print(incorrect_loo)
##
## Computed from 500 by 1000 log-likelihood matrix
##
##          Estimate   SE
## elpd_loo  -1042.5 12.2
## p_loo        39.3  2.2
## looic      2084.9 24.4
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     992   99.2%   172
##  (0.5, 0.7]   (ok)         7    0.7%   227
##    (0.7, 1]   (bad)        1    0.1%   379
##    (1, Inf)   (very bad)   0    0.0%   <NA>
## See help('pareto-k-diagnostic') for details.

With this calculation we can examine the models’ loo values, which shows the relative predictive performance of the model to the data. Overall, model performance seems quite good, as the Pareto k values show that there are only a few dozen observations in the dataset that aren’t well predicted. The LOO-IC, or the leave-one-out information criterion (think AIC or BIC), is much lower (i.e. better) for the correct model, as we would expect.

We can also compare the LOOIC of the two models explicitly using a second loo function that will even give us a confidence interval around the difference. If the difference is negative, then the first (correct) model has higher predictive accuracy:

compare(correct_loo,
incorrect_loo)
## elpd_diff        se
##    -412.3      14.8

The first (correct) model is clearly preferred, as we would certainly hope in this extreme example. In less obvious cases, the elpd values can help arbitrate between model choices when a theoretical reason for choosing a model does not exist.