Consider the following probabilistic model \[\begin{align*} X & \sim Negbin(r,p) \notag \\ p & \sim Beta(a_0,b_0) \notag \\ n & \sim Unif(\{1,\dotsc,X\}) \notag \end{align*}\] \(X\) represents the number of trials necessary to obtain \(r\) successes in a series of independent and identically distributed events with probability \(p\). We can generate an arbitrary value for \(X\) and set the hyperparameters \(a_0\) and \(b_0\) as desired:
#observed data
X <- 50
#hyperparameteers
a0 <- 10
b0 <- 10
#list of arguments
arglist <- list(data = X, hyp = c(a0,b0))
Since \(p\) is a continuous parameter and \(r\) is positive discrete, the classic Hamiltonian Monte Carlo algorithms don’t work, due to lack of differentiability. To address this, we can use the method described by Nishimura, Dunson, and Lu (2020) . First, we need to specify a continuous version of \(r\), which we will denote as \(\tilde{r}\). This continuous variable \(\tilde{r}\) is defined on the open interval \((1,X+1)\). Next, consider the trivial sequence \(\{a_n\}_1^{X+1}: a_n = n\). To transfer the mass probabilities of \(\mathbb{P}(r) = \frac{1}{X} \,,r = 1,\dotsc,X\) to the continuous interval spanning from \(a_{r}\) to \(a_{r+1}\) we divide by the length of this interval. Using this approach and the indicator function, the probability density function of \(\tilde{r}\) can be given by: \[ \pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\pi_R(r)}{a_{r+1} - a_r} \mathcal{I}(a_r < \tilde{r} \leq a_{r+1}) \] Since the support of this parameter is bounded, it is preferable to apply a transformation that maps \(\tilde{r}\) to \(\mathbb{R}\). Any bijective function can be used for this purpose; for example, the inverse logistic function serves this need: \[ \hat{r} = \log\left(\frac{\tilde{r} - 1}{ (X+1) - \tilde{r}}\right) \] The new discontinuous probability density can be derived from the cumulative distribution function. Interestingly, this approach yields the same result as applying the Jacobian determinant of the inverse transformation. \[ \pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\pi_R(r)}{a_{r+1} - a_r} \mathcal{I}(a_r < \tilde{r} \leq a_{r+1}) \cdot ( (X+1) - 1) \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right) \] where \(\tilde{r} = 1 + \frac{(X+1) - 1}{1 + e^{-\hat{r}}}\) and since \(\pi_R(r) = \frac{1}{X}\) the final result gives: \[ \pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\mathcal{I}(a_r < \tilde{r} \leq a_{r+1})}{a_{r+1} - a_r} \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right) \]
For the same reason, we could apply a similar transformation to map
the support of \(p\) from \((0,1)\) to \(\mathbb{R}\), Define: \[
\omega = \log\left(\frac{p}{1-p}\right)
\] which yields the following prior distribution for \(\omega\): \[
\pi(\omega) =
\left(\frac{1}{1+e^{-\omega}}\right)^{a_0}\cdot\left(\frac{e^{-\omega}}{1+e^{-\omega}}\right)^{b_0}
\] The resulting posterior distribution in this parameterization
is given by: \[
\pi(\omega,\hat{r}|X) \propto \binom{X-1}{r-1} \cdot
\left(\frac{1}{1+e^{-\omega}}\right)^{r +
a_0}\cdot\left(\frac{e^{-\omega}}{1+e^{-\omega}}\right)^{X - r + b_0 }
\cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right)
\] Ultimately, it can be easily shown that the negative logarithm
of the posterior distribution and its derivative with respect to \(\omega\) are as follows: \[\begin{align*}
-\log\pi(\omega,\hat{r}|X) \propto -\displaystyle\sum_{i = X - r +
1}^{X - 1} \log(i) &+ \displaystyle\sum_{i = 1}^{r-1}\log(i) + (X +
a_0+b_0)\log\left(1+e^{-\omega}\right) \notag \\
&+ (X - r + b_0)\omega + \hat{r} + 2\log\left(1+e^{-\hat{r}}\right)
\notag
\end{align*}\] \[
\frac{\partial-\log\pi(\omega,\hat{r}|X)}{\partial\omega} = (X - r +
b_0) - (X+a_0+b_0)\frac{e^{-\omega}}{1+e^{-\omega}}
\] We can define an R function to evaluate these quantities. To
be compatible with the package, the first argument should be the
parameter values, the second should be the list of arguments necessary
to evaluate the posterior, and the third should be a logical value: if
TRUE
, the function must return the negative logarithm of
the posterior distribution; otherwise, it should return the gradient
with respect to the continuous components, which, in this case, refers
to the first element of the parameter vector.
#function for the negative log posterior and its gradient
#with respect to the continuous components
nlp <- function(par,args,eval_nlp = TRUE){
if(eval_nlp){
#only the negative log posterior
#overflow
if(any(par > 30)) return(Inf)
#conversion of the r value
r <- floor(1 + args$data*plogis(par[2]))
#output
out <- sum(log(seq_len(r-1))) +
(args$data + args$hyp[1] + args$hyp[2])*log(1+exp(-par[1])) +
par[1]*(args$data - r + args$hyp[2]) + par[2] + 2*log(1+exp(-par[2]))
if(r > 2) out <- out - sum(log(seq(args$data - r + 1,args$data - 1)))
return(out)
}else{
#only the gradient
#overflow
if(any(par > 30)) return(Inf)
#conversion of the r value
r <- floor(1 + args$data*plogis(par[2]))
#output
return( (args$data - r + args$hyp[2]) -
(args$data + args$hyp[1] + args$hyp[2])*(1-plogis(par[1])) )
}
}
To run the algorithm, use the xdnuts
function. For
instance, if you want to run 4 chains, you need to provide a list of 4
initial vectors as the first argument.
#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) c(omega = rnorm(1),r_hat = rnorm(1))),
nlp = nlp,
args = arglist,
k = 1,
thin = 1,
parallel = FALSE,
method = "NUTS",
hide = TRUE)
After sampling, you can examine the results using the
plot
function:
By default, it displays the trace plots for each marginal chain,
which is useful for checking if the chains are mixing well. To view the
marginal and bivariate densities, specify type = 2
.
Examining the marginal densities can help diagnose poor convergence of the chains, though this does not seem to be an issue here.
Another useful plot can be generated with the type = 3
argument. This option overlays the energy level Markov chains. While it
is less sensitive to convergence issues, it provides a summary of the
global Markov process in a single chain, making it especially valuable
for large models.
With type = 4
, a stick plot of the trajectory lengths is
displayed, with data from different chains overlaid. This plot serves as
a good proxy for the computational cost of the procedure, as each step
requires evaluating both the negative logarithmic posterior and its
gradient multiple times.
With type = 5
, the histogram of the marginal energy levels
(in gray) is overlaid with the histogram of the corresponding proposal
values (in red).
As with the classic Metropolis scheme, when the proposal distribution matches the target distribution well, there is effective exploration of the parameter space where the probability mass is highest. A poor match would indicate a suboptimal choice of the transition kernel, which in this case is determined solely by the choice of the momenta distribution. This would be reflected in a more leptokurtic red histogram compared to the gray histogram. For more details on this topic, see Betancourt (2016a).
With type = 6
, the function plots the autocorrelation
function for each chain and parameter. This is an effective way to
diagnose slow exploration of the posterior distribution.
With type = 7
, the function plots the empirical Metropolis
acceptance rate and the refraction rate of the discontinuous component
for each chain. This plot helps diagnose suboptimal adaptation of the
step-size.
Last but not least, we can summarize the chain’s output using the
summary function.
summary(chains)
#> mean sd 5% 25% 50% 75% 95% ESS
#> omega 0.09927 0.45341 -0.64615 -0.20095 0.09966 0.40515 0.83205 2812.841
#> r_hat 0.09384 0.55231 -0.79911 -0.26481 0.09345 0.45826 0.98466 2600.518
#> R_hat R_hat_upper_CI
#> omega 1.00257 1.00808
#> r_hat 1.00245 1.00712
#>
#> Multivariate Gelman Test: 1.00212
#> Estimated Bayesian Fraction of Missing Information: 0.79836
Quantities of interest for each marginal distribution are returned, with the last two columns referring to the Potential Scale Reduction Factor statistic from Gelman and Rubin (1992). Values greater than 1.1 indicate non-convergence of the chains. Below the table, the multivariate version of this test is printed to the console, along with the Bayesian Fraction of Missing Information from Betancourt (2016a). If the latter is less than 0.2, it suggests slow exploration of the energy distribution.
It is possible to reuse the same models adopted previously while
treating the first parameters as inducing a discontinuity as well. In
this case, you should set k=2
in the xdnuts
function. For teaching purposes, this time the algorithm with the
exhaustion termination criterion from Betancourt
(2016b) will be used, specified by setting
method = "XHMC"
. In this scenario, you must specify a
threshold value for this method. A reasonable but sub-optimal value for
this threshold seems to be tau = 1
.
#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) rnorm(2)),
nlp = nlp,
args = arglist,
k = 2,
thin = 1,
parallel = FALSE,
method = "XHMC",
hide = TRUE,
tau = 1)
This time, we transform the chains back to their original
parameterization using the function xdtransform
#define the function to be applied to each sample
f <- function(x,XX) {
c(
plogis(x[1]), #inverse logistic for the probability
floor(1 + XX*plogis(x[2])) #number of trials
)
}
original_chains <- xdtransform(X = chains, which = NULL,
FUN = f,XX = arglist$data,
new.names = c("p","r"))
To avoid repetitions, not all plots are displayed.
The imperfect overlap of the marginal densities suggests poorer chain
convergence, primarily due to increased autocorrelation caused by
removing the pure Hamiltonian algorithm for the first parameters.
Increasing the sample size to N = 5e3
or using a greater
thinning interval, such as thin = 5
, can help address this
issue.
This behavior is typical for this type of termination criterion,
especially when applied to a discontinuous Hamiltonian system like this
one. For some trajectories, the algorithm may never reach a termination,
which significantly increases computational cost and can lead to the
sampler proposing extreme values that might cause numerical issues. To
avoid such problems, you can limit the depth of the binary tree
trajectory by setting
control = set_parameters(max_treedepth = 5)
.
As suggested by the first graph, the autocorrelations have increased, causing a slower convergence of the estimates.
summary(original_chains)
#> mean sd 5% 25% 50% 75% 95% ESS R_hat
#> p 0.56329 0.12356 0.34886 0.47866 0.56852 0.65195 0.7586 1256.005 1.00353
#> r 27.94750 7.35655 16.00000 23.00000 28.00000 33.00000 39.0000 1454.517 1.00320
#> R_hat_upper_CI
#> p 1.01146
#> r 1.01097
#>
#> Multivariate Gelman Test: 1.00372
#> Estimated Bayesian Fraction of Missing Information: 1.25657
It is noteworthy that the Effective Sample Size has decreased, as discussed by Geyer (2011). Additionally, the function could alerts us to the presence of premature termination of trajectories, as the second graph may suggests.
For this example, we can use a Gaussian hierarchical model applied to the blood viscosity dataset, which is available in the package:
data(viscosity)
viscosity
#> id Time.1 Time.2 Time.3 Time.4 Time.5 Time.6 Time.7
#> 1 1 68 42 69 64 39 66 29
#> 2 2 49 52 41 56 40 43 20
#> 3 3 41 40 26 33 42 27 35
#> 4 4 33 27 48 54 42 56 19
#> 5 5 40 45 50 41 37 34 42
#> 6 6 30 42 35 44 49 25 45
#create the list function
arglist <- list(data = as.matrix(viscosity[,-1]),
hyp = c(0, #mean a priori for mu
1000, #variance a priori for mu
0.5,1,0.5,1 #inverse gamma iperparameters
)
)
This dataset contains 7 blood viscosity measurements for each of 6 different subjects. The model considered is as follows:
\[\begin{align*} y_{ij} &\sim N(a_j,\sigma^2) \quad, i = 1,\dotsc,I\,,j = 1,\dotsc, J \notag \\ a_j &\sim N(\mu,\sigma_a^2) \notag \\ \mu &\sim N(m_0,s_0^2) \notag \\ \sigma^2 &\sim InvGamma(a_0,b_0) \notag \\ \sigma_a^2 &\sim InvGamma(a_1,b_1) \notag \end{align*}\]
with \(I = 7\), \(J = 6\) and \(n = I\times J\).
The variance parameters are transformed using a logarithmic scale to facilitate the algorithm’s exploration of the parameter space: \(\omega = \log\sigma^2\) and \(\omega_a = \log\sigma^2_a\). In this case, only the final results are presented, as the intermediate calculations are not necessary.
\[\begin{align*} -\log\pi(\theta|y) \propto &\quad\omega(n + a_0) + 0.5e^{-\omega}\left[2b_0 + \sum_{i,j} (y_{i,j} - a_j)^2 \right] \notag \\ &+ \omega_a(J + a_1) + 0.5e^{-\omega_a}\left[2b_1 + \sum_{j=1}^J(a_j - \mu)^2\right] + \frac{\mu^2 - 2\mu m_0}{2s_0^2} \notag \end{align*}\]
\[\begin{align*} \frac{\partial-\log\pi(\theta|y)}{\partial \mu} &= \frac{\mu - m_0}{s_0^2} - e^{-\omega_a}\displaystyle\sum_{j=1}^J (a_j - \mu)^2 \notag \\ \frac{\partial-\log\pi(\theta|y)}{\partial \omega} &= (n+a_0) - 0.5e^{-\omega}\left[2b_0 + \sum_{i,j} (y_{i,j} - a_j)^2\right] \notag \\ \frac{\partial-\log\pi(\theta|y)}{\partial \omega_a} &= (J+a_1) - 0.5e^{-\omega_a}\left[ 2b_1 + \displaystyle\sum_{j=1}^J (a_j - \mu)^2 \right] \end{align*}\]
These computations are summarized in the following R function:
nlp <- function(par,args,eval_nlp = TRUE){
if(eval_nlp){
#only the negative log posterior
#overflow
if(any(abs(par[2:3]) > 30)) return(Inf)
return(par[2] * ( prod(dim(args$data)) + args$hyp[3] ) +
(sum( (args$data - par[-(1:3)])^2 ) +
2*args$hyp[4])*exp(-par[2])/2 +
par[3] * (length(par[-(1:3)]) +
args$hyp[5]) +
(sum( (par[-(1:3)] - par[1])^2 ) +
2+args$hyp[6])*exp(-par[3])/2 +
(par[1] - args$hyp[1])^2/2/args$hyp[2])
}else{
#only the gradient
#overflow
if(any(abs(par[2:3]) > 30)) return(rep(Inf,9))
c(
-sum( par[-(1:3)] - par[1] )*exp(-par[3]) + (
par[1] - args$hyp[1])/args$hyp[2], #mu derivative
prod(dim(args$data)) + args$hyp[3] -
(sum( (args$data - par[-(1:3)])^2 ) +
2*args$hyp[4])*exp(-par[2])/2, #omega derivative
length(par[-(1:3)]) + args$hyp[5] -
(sum( (par[-(1:3)] - par[1])^2 ) +
2+args$hyp[6])*exp(-par[3])/2, #omega_a derivative
-apply(args$data - par[-(1:3)],1,sum)*exp(-par[2]) +
(par[-(1:3)] - par[1])*exp(-par[3]) #random effects gradient
)
}
}
The derivation of the gradient is recommended for computational
efficiency; however, it is always possible to calculate it numerically
using methods like the numDeriv::grad
function.
Since this is the only algorithm not yet explored, we will use the
standard Hamiltonian Monte Carlo with a fixed trajectory length
L
. In practice, the trajectory length L
is
sampled uniformly from the interval
[max(1,L - L_jitter)
,L + L_jitter
]. By
default, L_jitter
is set to 1, while L
must be
specified by the user. A sub-optimal but reasonable value for
L
is 20
.
#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) {
out <- rnorm(3 + 6)
names(out) <- c("mu","log_sigma2","log_sigma2_a",
paste0("mu",1:6))
out}),
nlp = nlp,
args = arglist,
k = 0, #no discontinuous components
thin = 1,
parallel = FALSE,
method = "HMC",
hide = TRUE,
L = 20,
control = set_parameters(L_jitter = 5))
We can transform the variance components back from their logarithmic
parameterization using the xdtransform
function. This time,
only these components should be selected.
original_chains <- xdtransform(X = chains,which = 2:3,
FUN = exp,new.names = c("sigma2","sigma2_a"))
Since the parameter dimension is relatively high, we can examine the chain of energy level sets for a mixing diagnostic.
Although in a Bayesian context the only difference concerns the informativeness of the priors, we can plot the density plots of the and parameters separately.
Next, it is useful to check the autocorrelation plots for each chain:
As expected from theory Gelfand, Sahu, and
Carlin (1995), the centered parameterization of the model makes
exploring the random effects variance components challenging, which is
reflected in the strong autocorrelation of their chains.
Finally, the summary function:
summary(original_chains)
#> mean sd 5% 25% 50% 75% 95%
#> mu 41.82778 1.33343 39.61992 40.94063 41.81027 42.70707 44.06480
#> sigma2 71.10762 12.36267 53.97771 62.69590 69.50171 77.68585 93.14380
#> sigma2_a 0.77694 0.69979 0.22426 0.36124 0.53629 0.93777 2.15117
#> mu1 42.68049 1.60977 40.24659 41.54516 42.63272 43.67899 45.45473
#> mu2 41.92201 1.47945 39.47879 40.94731 41.92428 42.89715 44.38296
#> mu3 41.34194 1.52379 38.87303 40.33515 41.33641 42.37289 43.78504
#> mu4 41.62059 1.48213 39.14401 40.68174 41.61924 42.61911 44.02651
#> mu5 41.77944 1.49325 39.42050 40.76230 41.74429 42.77138 44.29616
#> mu6 41.54661 1.49635 39.08980 40.53589 41.54691 42.54118 44.05494
#> ESS R_hat R_hat_upper_CI
#> mu 1072.9918 1.02082 1.04927
#> sigma2 412.5059 1.02488 1.06126
#> sigma2_a 216.0952 1.23888 1.70282
#> mu1 439.9831 1.03900 1.10723
#> mu2 940.9661 1.01616 1.03492
#> mu3 708.7164 1.03525 1.07290
#> mu4 841.0502 1.01340 1.03287
#> mu5 993.2040 1.02011 1.04982
#> mu6 855.8656 1.01754 1.04256
#>
#> Multivariate Gelman Test: 1.12237
#> Estimated Bayesian Fraction of Missing Information: 1.5032
By using the xdextract
function, you can rearrange the
chains into a more convenient format, such as an array or a matrix. This
is useful for computing posterior quantities, including model
predictions.
#extract samples into matrix
theta <- xdextract(original_chains,collapse = TRUE)
#compute prediction
y_hat <- sapply(1:6, function(i){
rnorm(NROW(theta),theta[,3+i],sqrt(theta[,2]))
})
#plot prediction
op <- par(no.readonly = TRUE)
par(mar=c(5.1, 4.1, 2.1, 4.1), xpd=TRUE)
plot(NULL, xlim = c(1,6), ylim = c(15,85), xlab = "Subject",
ylab = "Viscosity", bty = "L")
for(i in 1:6){
#data
points(rep(i,7),viscosity[i,-1], pch = 16)
#data 95% credible intervals for the prediction
lines(rep(i,2),quantile(y_hat[,i],c(0.025,0.975)), col = 3, lwd = 3)
#random effects 95% credible intervals
lines(rep(i,2),quantile(theta[,3+i],c(0.025,0.975)), col = 4, lwd = 4)
}
legend("topright",inset=c(-0.2,-0.2), lty = 1, lwd = 2, col = c(3,4),
title = "95% Credible Intervals",
legend = c("prediction","random effects"),
bty = "n", bg = "transparent", cex = 0.8)
The investigation of the first subject, which appears to be an outlier, is beyond the scope of this analysis. However, the hierarchical model’s property of borrowing strength across subjects seems to be effective.