## Note that, since v0.1.6.2, DHARMa includes support for glmmTMB, but there are still a few minor limitations associatd with this package. Please see https://github.com/florianhartig/DHARMa/issues/16 for details, in particular if you use this for production.

Motivation

Residual interpretation for generalized linear mixed models (GLMMs) is often problematic. As an example, here two Poisson GLMMs, one that is lacking a quadratic effect, and one that fits the data perfectly. I show three standard residuals diagnostics each. Which is the misspecified model?

plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2

Just for completeness - it was the first one. But don't get too excited if you got it right. Either you were lucky, or you noted that the first model seems a bit overdispersed (range of the Pearson residuals). But even when noting that, would you have added a quadratic effect, instead of adding an overdispersion correction? The point here is that misspecifications in GL(M)Ms cannot reliably be diagnosed with standard residual plots, and GLMMs are thus often not as thoroughly checked as LMs.

One reason why GL(M)Ms residuals are harder to interpret is that the expected distribution of the data changes with the fitted values. Reweighting with the expected variance, as done in Pearson residuals, or using deviance residuals, helps a bit, but does not lead to visually homogenous residuals even if the model is correctly specified. As a result, standard residual plots, when interpreted in the same way as for linear models, seem to show all kind of problems, such as non-normality, heteroscedasticity, even if the model is correctly specified. Questions on the R mailing lists and forums show that practitioners are regularly confused about whether such patterns in GL(M)M residuals are a problem or not.

But even experienced statistical analysts currently have few options to diagnose misspecification problems in GLMMs. In my experience, the current standard practice is to eyeball the residual plots for major misspecifications, potentially have a look at the random effect distribution, and then run a test for overdispersion, which is usually positive, after which the model is modified towards an overdispersed / zero-inflated distribution. This approach, however, has a number of problems, notably:

DHARMa aims at solving these problems by creating readily interpretable residuals for generalized linear (mixed) models that are standardized to values between 0 and 1, and that can be interpreted as intuitively as residuals for the linear model. This is achieved by a simulation-based approach, similar to the Bayesian p-value or the parametric bootstrap, that transforms the residuals to a standardized scale. The basic steps are:

  1. Simulate new data from the fitted model for each observation.

  2. For each observation, calculate the empirical cumulative density function for the simulated observations, which describes the possible values (and their probability) at the predictor combination of the observed value, assuming the fitted model is correct.

  3. The residual is then defined as the value of the empirical density function at the value of the observed data, so a residual of 0 means that all simulated values are larger than the observed value, and a residual of 0.5 means half of the simulated values are larger than the observed value.

These steps are visualized in the following figure

The key idea for this definition is that, if the model is correctly specified, then the observed data should look like as if it was created from the fitted model. Hence, for a correctly specified model, all values of the cumulative distribution should appear with equal probability. That means we expect the distribution of the residuals to be flat, regardless of the model structure (Poisson, binomial, random effects and so on).

I currently prepare a more exact statistical justification for the approach in an accompanying paper, but if you must provide a reference in the meantime I would suggest citing

p.s.: DHARMa stands for “Diagnostics for HierArchical Regression Models” – which, strictly speaking, would make DHARM. But in German, Darm means intestines; plus, the meaning of DHARMa in Hinduism makes the current abbreviation so much more suitable for a package that tests whether your model is in harmony with your data:

From Wikipedia, 28/08/16: In Hinduism, dharma signifies behaviours that are considered to be in accord with rta, the order that makes life and universe possible, and includes duties, rights, laws, conduct, virtues and ‘‘right way of living’’.

Workflow in DHARMa

Installing, loading and citing the package

If you haven't installed the package yet, either run

install.packages("DHARMa")

Or follow the instructions on https://github.com/florianhartig/DHARMa to install a development version.

Loading and citation

library(DHARMa)
citation("DHARMa")
## 
## To cite package 'DHARMa' in publications use:
## 
##   Florian Hartig (2018). DHARMa: Residual Diagnostics for
##   Hierarchical (Multi-Level / Mixed) Regression Models. R package
##   version 0.2.0. http://florianhartig.github.io/DHARMa/
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models},
##     author = {Florian Hartig},
##     year = {2018},
##     note = {R package version 0.2.0},
##     url = {http://florianhartig.github.io/DHARMa/},
##   }

Calculating scaled residuals

The scaled (quantile) residuals are calculated with the simulateResiduals() function. The default number of simulations to run is 250, which proved to be a reasonable compromise between computation time and precision, but if high precision is desired, n should be raised to 1000 at least.

simulationOutput <- simulateResiduals(fittedModel = fittedModel, n = 250)

What the function does is a) creating n new synthetic datasets by simulating from the fitted model, b) calculates the cumulative distribution of simulated values for each observed value, and c) returning the quantile value that corresponds to the observed value.

For example, a scaled residual value of 0.5 means that half of the simulated data are higher than the observed value, and half of them lower. A value of 0.99 would mean that nearly all simulated data are lower than the observed value. The minimum/maximum values for the residuals are 0 and 1.

The calculated residuals are stored in

simulationOutput$scaledResiduals

As discussed above, for a correctly specified model we would expect

Note: the expected uniform distribution is the only differences to the linear regression that one has to keep in mind when interpreting DHARMa residuals. If you cannot get used to this and you must have residuals that behave exactly like a linear regression, you can access a normal transformation of the residuals via

simulationOutput$scaledResidualsNormal

These normal residuals will behave exactly like the residuals of a linear regression. However, for reasons of a) numeric stability with low number of simulations and b) my conviction that it is much easier to visually detect deviations from uniformity than normality, I would STRONGLY advice against using this transformation.

Plotting the scaled residuals

We can get a visual impression of these properties with the plot.DHARMa() function

plot(simulationOutput)

plot of chunk unnamed-chunk-8

which creates a qq-plot to detect overall deviations from the expected distribution, and a plot of the residuals against the predicted value.

To provide a visual aid in detecting deviations from uniformity in y-direction, the plot of the residuals against the predicted values also performs an (optional) quantile regression, which provides 0.25, 0.5 and 0.75 quantile lines across the plots. These lines should be straight, horizontal, and at y-values of 0.25, 0.5 and 0.75. Note, however, that some deviations from this are to be expected by chance, even for a perfect model, especially if the sample size is small.

The quantile regression can be very slow for large datasets. You can chose to use a simpler method with the option quantreg = F.

If you want to plot the residuals against other predictors (highly recommend), you can use the function

plotResiduals(YOURPREDICTOR, simulationOutput$scaledResiduals)

which does the same quantile plot as the main plotting function.

Formal goodness-of-fit tests on the scaled residuals

To support the visual inspection of the residuals, the DHARMa package provides a number of specialized goodness-of-fit tests on the simulated residuals. For example, the function

testUniformity(simulationOutput = simulationOutput)

plot of chunk unnamed-chunk-10

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.052, p-value = 0.5085
## alternative hypothesis: two-sided

runs a KS test to test for overall uniformity of the residuals. There are a number of further tests

that basically do what they say. See the help of the functions and further comments below for a more detailed description.

Simulation options

There are a few important technical details regarding how the simulations are performed, in particular regarding the treatments of random effects and integer responses. I would therefore strongly recommend to read the help of

?simulateResiduals

The short summary is this: apart from the number of simulations, there are three important options in the simulateResiduals function

Refit

simulationOutput <- simulateResiduals(fittedModel = fittedModel, refit = T)

The second option is much much slower, and also seemed to have lower power in some tests I ran. It is therefore not recommended for standard residual diagnostics! I only recommend using it in two situations

  1. For running tests that rely on comparing observed to simulated residuals, e.g. the testOverdispersion function (see below),

  2. Or, and this was my original motivation for introducing this option, if one expects that the tested model is biased. A bias could, for example, arise in small data situations, or when estimating models with shrinkage estimators that include a purposeful bias, such as ridge/lasso, random effects or the splines in GAMs. My idea was then that simulated data would not fit to the observations, but that residuals for model fits on simulated data would have the same patterns/bias than model fits on the observed data.

Note also that refit = T can sometimes run into numerical problems, if the fitted model does not converge on the newly simulated data.

Random effect simulations

The second option is the treatment of the stochastic hierarchy. In a hierarchical model, several layers of stochasticity are placed on top of each other. Specifically, in a GLMM, we have a lower level stochastic process (random effect), whose result enters into a higher level (e.g. Poisson distribution). For other hierarchical models, such as state-space models, similar considerations apply, but the hierarchy can be more complex. When simulating, we have to decide if we want to re-simulate all stochastic levels, or only a subset of those. For example, in a GLMM, it is common to only simulate the last stochastic level (e.g. Poisson) conditional on the fitted random effects, meaning that the random effects are set on the fitted values.

For controlling how many levels should be re-simulated, the simulateResidual function allows to pass on parameters to the simulate function of the fitted model object. Please refer to the help of the different simulate functions (e.g. ?simulate.merMod) for details. For merMod (lme4) model objects, the relevant parameters are “use.u”, and “re.form”, as, e.g., in

simulationOutput <- simulateResiduals(fittedModel = fittedModel, n = 250, use.u = T)

If the model is correctly specified and the fitting procedure is unbiased (disclaimer: GLMM estimators are not always unbiased), the simulated residuals should be flat regardless how many hierarchical levels we re-simulate. The most thorough procedure would be therefore to test all possible options. If testing only one option, I would recommend to re-simulate all levels, because this essentially tests the model structure as a whole. This is the default setting in the DHARMa package. A potential drawback is that re-simulating the random effects creates more variability, which may reduce power for detecting problems in the upper-level stochastic processes.

Integer treatment / randomization

A third option is the treatment of integer responses. The background of this option is that, for integer-valued variables, some additional steps are neccessary to make sure that the residual distribution becomes flat (essentially, we have to smoothen away the integer nature of the data). The idea is explained in

The simulateResiduals function will automatically check if the family is integer valued, and apply randomization if that is the case. I see no reason why one would not want to randomize for an integer-valued function, so the parameter should usually not be changed.

Calculating residuals per group

In many situations, it can be useful to look at residuals per group, e.g. to see how much the model over / underpredicts per plot, year or subject. To do this, use the recalculateResiduals() function, together with a grouping variable

simulationOutput = recalculateResiduals(simulationOutput, group = testData$group)

you can keep using the simulation output as before. Note, hover, that items such as simulationOutput$scaledResiduals now have as many entries as you have groups, so if you perform plots by hand, you have to aggregate predictors in the same way. For the latter purpose, recalculateResiduals adds a function aggregateByGroup to the output.

Reproducibility notes, random seed and random state

As DHARMa uses simulations to calculate the residuals, a naive implementation of the algorithm would mean that residuals would look slightly different each time a DHARMa calculation is executed. This might only be confusing to a user, but it also bears the danger that one might run the simulation several times and take the result that looks better (which would amount to multiple testing / p-hacking).

By default, DHARMa therefore fixes the random seed to the same value every time a simulation is run, and afterwards restores the random state to the old value. This means that you will get exactly the same residual plot each time. If you want to avoid this behavior, for example for simulation experiments on DHARMa, use seed = NULL -> no seed set, but random state will be restored, or seed = F -> no seed set, and random state will not be restored. Whether or not you fix the seed, the setting for the random seed and the random state are stored in

simulationOutput$randomState

If you want to reproduce simualtions for such a run, set the variable .Random.seed by hand, and simulate with seed = NULL.

Moreover (general advice), to ensure reproducibility, it's advisable to add a set.seed() at the beginning, and a session.info() at the end of your script. The latter will lists the version number of R and all loaded packages.

Visual diagnostics and tests of common misspecification problems

In all plots / tests that were shown so far, the model was correctly specified, resulting in “perfect” residual plots. In this section, we discuss how to recognize and interpret model misspecifications in the scaled residuals.

Overdispersion / underdispersion

The most common concern for GLMMs is overdispersion, underdispersion and zero-inflation.

Over/underdispersion refers to the phenomenon that residual variance is larger/smaller than expected under the fitted model. Over/underdispersion can appear for any distributional family with fixed variance, in particular for Poisson and binomial models.

A few general rules of thumb

An example of overdispersion

This this is how overdispersion looks like in the DHARMa residuals

testData = createData(sampleSize = 500, overdispersion = 2, family = poisson())
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) , family = "poisson", data = testData)

simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)

plot of chunk unnamed-chunk-16

Note that we get more residuals around 0 and 1, which means that more residuals are in the tail of distribution than would be expected under the fitted model.

An example of underdispersion

This is an example of underdispersion

testData = createData(sampleSize = 500, intercept=0, fixedEffects = 2, overdispersion = 0, family = poisson(), roundPoissonVariance = 0.001, randomEffectVariance = 0)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) , family = "poisson", data = testData)

summary(fittedModel)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: observedResponse ~ Environment1 + (1 | group)
##    Data: testData
## 
##      AIC      BIC   logLik deviance df.resid 
##   1031.1   1043.8   -512.6   1025.1      497 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -0.64083 -0.35390 -0.05813  0.22834  0.91703 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  group  (Intercept) 0        0       
## Number of obs: 500, groups:  group, 10
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.13024    0.05831  -2.233   0.0255 *  
## Environment1  2.19567    0.08519  25.772   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## Environmnt1 -0.818
# plotConventionalResiduals(fittedModel)

simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)

plot of chunk unnamed-chunk-17

testUniformity(simulationOutput = simulationOutput)

plot of chunk unnamed-chunk-17

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.22, p-value < 2.2e-16
## alternative hypothesis: two-sided

Here, we get too many residuals around 0.5, which means that we are not getting as many residuals as we would expect in the tail of the distribution than expected from the fitted model.

Testing for over/underdispersion

Although, as discussed above, over/underdispersion will show up in the residuals, and it's possible to detect it with the testUniformity function, simulations show that this test is less powerful than more targeted tests.

DHARMa therefore contains two overdispersion tests that compares the dispersion of simulated residuals to the observed residuals.

  1. A non-parametric test on the simulated residuals
  2. A non-parametric overdispersion test on the re-fitted residuals.

You can call these tests as follows:

# Option 2
testDispersion(simulationOutput)

plot of chunk overDispersionTest

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted
##  vs. simulated
## 
## data:  simulationOutput
## ratioObsSim = 0.24135, p-value < 2.2e-16
## alternative hypothesis: two.sided
# Option 3
simulationOutput2 <- simulateResiduals(fittedModel = fittedModel, refit = T, n = 20)
testDispersion(simulationOutput2)

plot of chunk overDispersionTest

## 
##  DHARMa nonparametric dispersion test via mean deviance residual
##  fitted vs. simulated-refitted
## 
## data:  simulationOutput2
## dispersion = 0.15184, p-value < 2.2e-16
## alternative hypothesis: two.sided

Note: previous versions of DHARMa (< 0.2.0) discouraged the simulated overdispersion test in favor of the refitted and parametric tests. I have since changed the test function, and simulations show that it as powerful as the refitted or parametric test. Because of the generality and speed of this option, I see no good reason for either refitting or running parametric tests. Therefore

  1. My recommendation for testing dispersion is to simply use the standard dispersion test, based on the simulated residuals

  2. It's not clear to if the refitted test is better … but it's available.

  3. In my simulations, parametric tests, such as AER::dispersiontest didn't provide higher power. Because of that, and because of the higher generality of the simulated tests, I no longer provide parametric tests in DHARMa. However, you can see various implementions of the parametric tests in the DHARMa GitHub repo under Code/DHARMaPerformance/Power).

Below and example from there, which compares the four options to test for overdispersion (2 options to use DHARMa::testDispersoin, AER::dispersiontest, and DHARMa::testUniformity) for a Poisson glm

Comparison of power from simulation studies

A word of warning that applies also to all other tests that follow: significance in hypothesis tests depends on at least 2 ingredients: strenght of the signal, and number of data points. Hence, the p-value alone is not a good indicator of the extent to which your residuals deviate from assumptions. Specifically, if you have a lot of data points, residual diagnostics will nearly inevitably become significant, because having a perfectly fitting model is very unlikely. That, however, doesn't neccessarily mean that you need to change your model. The p-values confirm that there is a deviation from your null hypothesis. It is, however, in your discretion to decide whether this deviation is worth worrying about. If you see a dispersion parameter of 1.01, I would not worry, even if the test is significant. A significant value of 5, however, is clearly a reason to move to a model that accounts for overdispersion.

Zero-inflation / k-inflation or deficits

A common special case of overdispersion is zero-inflation, which is the situation when more zeros appear in the observation than expected under the fitted model. Zero-inflation requires special correction steps.

More generally, we can also have too few zeros, or too much or too few of any other values. We'll discuss that at the end of this section

An example of zero-inflation

Here an example of a typical zero-inflated count dataset, plotted against the environmental predictor

testData = createData(sampleSize = 500, intercept = 2, fixedEffects = c(1), overdispersion = 0, family = poisson(), quadraticFixedEffects = c(-3), randomEffectVariance = 0, pZeroInflation = 0.6)

par(mfrow = c(1,2))
plot(testData$Environment1, testData$observedResponse, xlab = "Envrionmental Predictor", ylab = "Response")
hist(testData$observedResponse, xlab = "Response", main = "")

plot of chunk unnamed-chunk-18

We see a hump-shaped dependence of the environment, but with too many zeros.

Zero-inflation in the scaled residuals

In the normal DHARMa residual, plots, zero-inflation will look pretty much like overdispersion

fittedModel <- glmer(observedResponse ~ Environment1 + I(Environment1^2) + (1|group) , family = "poisson", data = testData)

simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)