Estimating Distributional Models with brms

Paul Bürkner



This vignette provides an introduction on how to fit distributional regression models with brms. We use the term distributional model to refer to a model, in which we can specify predictor terms for all parameters of the assumed response distribution. In the vast majority of regression model implementations, only the location parameter (usually the mean) of the response distribution depends on the predictors and corresponding regression parameters. Other parameters (e.g., scale or shape parameters) are estimated as auxilliary parameters assuming them to be constant across observations. This assumption is so common that most researchers applying regression models are often (in my experience) not aware of the possibility of relaxing it. This is understandable insofar as relaxing this assumption drastically increase model complexity and thus makes models hard to fit. Fortunately, brms uses Stan on the backend, which is an incredibly flexible and powerful tool for estimating Bayesian models so that model complexity is much less of an issue.

Suppose we have a normally distributed response variable. Then, in basic linear regression, we specify a predictor term \(\eta_{\mu}\) for the mean parameter \(\mu\) of the normal distribution. The second parameter of the normal distribution – the residual standard deviation \(\sigma\) – is assumed to be constant across observations. We estimate \(\sigma\) but do not try to predict it. In a distributional model, however, we do exactly this by specifying a predictor term \(\eta_{\sigma}\) for \(\sigma\) in addition to the predictor term \(\eta_{\mu}\). Ignoring group-level effects for the moment, the linear predictor of a parameter \(\theta\) for observation \(n\) has the form

\[\eta_{\theta n} = \sum_{i = 1}^{K_{\theta}} b_{\theta i} x_{\theta i n}\] where \(x_{\theta i n}\) denotes the value of the \(i\)th predictor of parameter \(\theta\) for observation \(n\) and \(b_{\theta i}\) is the \(i\)th regression coefficient of parameter \(\theta\). A distributional normal model with response variable \(y\) can then be written as

\[y_n \sim \mathcal{N}\left(\eta_{\mu n}, \, \exp(\eta_{\sigma n}) \right)\] We used the exponential function around \(\eta_{\sigma}\) to reflect that \(\sigma\) constitutes a standard deviation and thus only takes on positive values, while a linear predictor can be any real number.

Throughout the vignette, R code output is not shown to reduce installation time and size of the package.

A simple distributional model

Unequal variance models are possibly the most simple, but nevertheless very important application of distributional models. Suppose we have two groups of patients: One group recieves a treatment (e.g., an antidepressive drug) and another group recieves placebo. Since the treatment may not work equally well for all patients, the symptom variance of the treatment group may be larger than the symptom variance of the placebo group after some weeks of treatment. For simplicity, assume that we only investigate the post-treatment values.

group <- rep(c("treat", "placebo"), each = 30)
symptom_post <- c(rnorm(30, mean = 1, sd = 2), rnorm(30, mean = 0, sd = 1))
dat1 <- data.frame(group, symptom_post)

The following model estimates the effect of group on both the mean and the residual standard deviation of the normal response distribution.

fit1 <- brm(bf(symptom_post ~ group, sigma ~ group), 
            data = dat1, family = gaussian())

Useful summary statistics and plots can be obtained via

plot(marginal_effects(fit1), points = TRUE)

The population-level effect sigma_grouptreat, which is the contrast of the two residual standard deviations on the log-scale, reveals that the variances of both groups are indeed different. This impression is confirmed when looking at the marginal_effects of group. Going one step further, we can compute the residual standard deviations on the original scale using the hypothesis method.

hyp <- c("exp(sigma_Intercept) = 0",
         "exp(sigma_Intercept + sigma_grouptreat) = 0")
hypothesis(fit1, hyp)

We may also directly compare them and plot the posterior distribution of their difference.

hyp <- "exp(sigma_Intercept + sigma_grouptreat) > exp(sigma_Intercept)"
(hyp <- hypothesis(fit1, hyp))
plot(hyp, chars = NULL)

Indeed, the residual standard deviation of the treatment group seems to larger than that of the placebo group. Moreover the magnitude of this difference is pretty similar to what we expected due to the values we put into the data simulations.

Zero-Inflated Models

Another important application of the distributional regression framwork are so called zero-inflated models. These models are helpful whenever there are more zeros in the response variable than one would naturally expect. For example, if one seeks to predict the number of cigarettes people smoke per day and also includes non-smokers, there will be a huge amount of zeros which, when not modeled appropriately, can seriously distort parameter estimates. Here, we consider another example dealing with the number of fish caught by various groups of people.

zinb <- read.csv("")

As predictors we choose the number of people per group, the number of children, as well as whether the group consists of campers. Many groups may not catch any fish and so we directly fit a zero-inflated Poisson model. For now, we assume a constant zero-inflation probability across observations.

fit_zinb1 <- brm(count ~ persons + child + camper, 
                 data = zinb, family = zero_inflated_poisson())

Again, we summarize the results using the usual methods.

plot(marginal_effects(fit_zinb1), ask = FALSE)

According to the parameter estimates, larger groups catch more fish, campers catch more fish than non-campers, and children are really bad at catching fish. The zero-inflation probability zi is pretty large with a mean of 41%. Please note that the probability of catching no fish is actually higher than 41%, but parts of this probability are already modeled by the poisson distribution itself (hence the name zero-inflation). If you want to treat all zeros as origniating from a separate process, you can use hurdle models instead (not shown here).

Now, we try to additionally predict the zero-inflation probability by the number of children. But why should we predict the zero-inflation probability at all? One line of reasong might go as follows. We assume two processes behind our response, the first being the basic Poisson process and the second being the zero-inflation process. For the smoker example, it is easy to interprete the zero-inflation process as the decision between smoking and non-smoking in general, which is definitely worth investigating using regression techniques. For other applications of zero-inflated models, for instance in the fishing example, it may not necessarily be that easy to interprete the zero-inflation process (or maybe it is, I have no idea of fishing), but the statistical idea remains the same: Each of the two processes comes with its own implied distribution and the zero-inflated Poisson distribution is a mixture of the two. From this perspective, predicting both parts of the model is natural and reasonable to fully understand the data.

fit_zinb2 <- brm(bf(count ~ persons + child + camper, zi ~ child), 
                 data = zinb, family = zero_inflated_poisson())
plot(fit_zinb2, N = 6)
plot(marginal_effects(fit_zinb2), ask = FALSE)

To transform the linear predictor of zi into a probability, brms implicitly applies the logit-link:

\[logit(zi) = \log\left(\frac{zi}{1-zi}\right) = \eta_{zi}\]

The logit-link takes values within \([0, 1]\) and returns values on the real line. Thus, it allows the transition between probabilities and linear predictors.

According to the model, trying to fish with children not only decreases the overall number fish caught (as implied by the Poisson part of the model), but on top of that also drastically increases your change of catching no fish at all (as implied by the zero-inflation part).

Additive Distributional Models

In the examples so far, we did not have multilevel data and thus did not fully use the capabilities of the distributional regression framework of brms. In the example presented below, we will not only show how to deal with multilevel data in distributional models, but also how to incorporate smooth terms (i.e., splines) into the model. In many applications, we have no or only a very vaque idea how the relationship between a predictor and the response looks like. A very flexible approach to tackle this problems is to use splines and let them figure out the form of the relationship. For illustration purposes, we simulate some data with the mgcv package, which is also used in brms to prepare smooth terms.

dat_smooth <- gamSim(eg = 6, n = 200, scale = 2)

The data contains the predictors x0 to x3 as well as the grouping factor fac indicating the nested structure of the data. We predict the response variable y using smooth terms of x1 and x2 and a varying intercept of fac. In addition, we assume the residual standard deviation sigma to vary by a smoothing term of x0 and a varying intercept of fac.

fit_smooth1 <- brm(bf(y ~ s(x1) + s(x2) + (1|fac), sigma ~ s(x0) + (1|fac)),
                   data = dat_smooth, family = gaussian(),
                   chains = 2, control = list(adapt_delta = 0.95))
plot(fit_smooth1, ask = FALSE)
plot(marginal_effects(fit_smooth1), points = TRUE, ask = FALSE)

This model is likely an overkill for the data at hand, but nicely demonstrates the ease with which one can specify complex models with brms and to fit them using Stan on the backend.