# N-mixtures in mvgam

#### 2024-04-16

The purpose of this vignette is to show how the mvgam package can be used to fit and interrogate N-mixture models for population abundance counts made with imperfect detection.

## N-mixture models

An N-mixture model is a fairly recent addition to the ecological modeller’s toolkit that is designed to make inferences about variation in the abundance of species when observations are imperfect (Royle 2004). Briefly, assume $$\boldsymbol{Y_{i,r}}$$ is the number of individuals recorded at site $$i$$ during replicate sampling observation $$r$$ (recorded as a non-negative integer). If multiple replicate surveys are done within a short enough period to satisfy the assumption that the population remained closed (i.e. there was no substantial change in true population size between replicate surveys), we can account for the fact that observations aren’t perfect. This is done by assuming that these replicate observations are Binomial random variables that are parameterized by the true “latent” abundance $$N$$ and a detection probability $$p$$:

\begin{align*} \boldsymbol{Y_{i,r}} & \sim \text{Binomial}(N_i, p_r) \\ N_{i} & \sim \text{Poisson}(\lambda_i) \end{align*}

Using a set of linear predictors, we can estimate effects of covariates $$\boldsymbol{X}$$ on the expected latent abundance (with a log link for $$\lambda$$) and, jointly, effects of possibly different covariates (call them $$\boldsymbol{Q}$$) on detection probability (with a logit link for $$p$$):

\begin{align*} log(\lambda) & = \beta \boldsymbol{X} \\ logit(p) & = \gamma \boldsymbol{Q}\end{align*}

mvgam can handle this type of model because it is designed to propagate unobserved temporal processes that evolve independently of the observation process in a State-space format. This setup adapts well to N-mixture models because they can be thought of as State-space models in which the latent state is a discrete variable representing the “true” but unknown population size. This is very convenient because we can incorporate any of the package’s diverse effect types (i.e. multidimensional splines, time-varying effects, monotonic effects, random effects etc…) into the linear predictors. All that is required for this to work is a marginalization trick that allows Stan’s sampling algorithms to handle discrete parameters (see more about how this method of “integrating out” discrete parameters works in this nice blog post by Maxwell Joseph).

The family nmix() is used to set up N-mixture models in mvgam, but we still need to do a little bit of data wrangling to ensure the data are set up in the correct format (this is especially true when we have more than one replicate survey per time period). The most important aspects are: (1) how we set up the observation series and trend_map arguments to ensure replicate surveys are mapped to the correct latent abundance model and (2) the inclusion of a cap variable that defines the maximum possible integer value to use for each observation when estimating latent abundance. The two examples below give a reasonable overview of how this can be done.

## Example 2: a larger survey with possible nonlinear effects

Now for another example with a larger dataset. We will use data from Jeff Doser’s simulation example from the wonderful spAbundance package. The simulated data include one continuous site-level covariate, one factor site-level covariate and two continuous sample-level covariates. This example will allow us to examine how we can include possibly nonlinear effects in the latent process and detection probability models.

Download the data and grab observations / covariate measurements for one species

# Date link
data.one.sp <- dataNMixSim

# Pull out observations for one species
data.one.sp$y <- data.one.sp$y[1, , ]

# Abundance covariates that don't change across repeat sampling observations
abund.cov <- dataNMixSim$abund.covs[, 1] abund.factor <- as.factor(dataNMixSim$abund.covs[, 2])

# Detection covariates that can change across repeat sampling observations
# Note that NAs are not allowed for covariates in mvgam, so we randomly
# impute them here
det.cov <- dataNMixSim$det.covs$det.cov.1[,]
det.cov[is.na(det.cov)] <- rnorm(length(which(is.na(det.cov))))
det.cov2 <- dataNMixSim$det.covs$det.cov.2
det.cov2[is.na(det.cov2)] <- rnorm(length(which(is.na(det.cov2))))

Next we wrangle into the appropriate ‘long’ data format, adding indicators of time and series for working in mvgam. We also add the cap variable to represent the maximum latent N to marginalize over for each observation

mod_data <- do.call(rbind,
lapply(1:NROW(data.one.sp$y), function(x){ data.frame(y = data.one.sp$y[x,],
abund_cov = abund.cov[x],
abund_fac = abund.factor[x],
det_cov = det.cov[x,],
det_cov2 = det.cov2[x,],
replicate = 1:NCOL(data.one.sp$y), site = paste0('site', x)) })) %>% dplyr::mutate(species = 'sp_1', series = as.factor(paste0(site, '_', species, '_', replicate))) %>% dplyr::mutate(site = factor(site, levels = unique(site)), species = factor(species, levels = unique(species)), time = 1, cap = max(data.one.sp$y, na.rm = TRUE) + 20)

The data include observations for 225 sites with three replicates per site, though some observations are missing

NROW(mod_data)
#> [1] 675
dplyr::glimpse(mod_data)
#> Rows: 675
#> Columns: 11
#> $y <int> 1, NA, NA, NA, 2, 2, NA, 1, NA, NA, 0, 1, 0, 0, 0, 0, NA, NA… #>$ abund_cov <dbl> -0.3734384, -0.3734384, -0.3734384, 0.7064305, 0.7064305, 0.…
#> $abund_fac <fct> 3, 3, 3, 4, 4, 4, 9, 9, 9, 2, 2, 2, 3, 3, 3, 2, 2, 2, 1, 1, … #>$ det_cov   <dbl> -1.28279990, -0.08474811, 0.44789392, 1.71731815, 0.19548086…
#> $det_cov2 <dbl> 2.03047314, -1.42459158, 1.68497337, 0.75026787, 1.04555361,… #>$ replicate <int> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, …
#> $site <fct> site1, site1, site1, site2, site2, site2, site3, site3, site… #>$ species   <fct> sp_1, sp_1, sp_1, sp_1, sp_1, sp_1, sp_1, sp_1, sp_1, sp_1, …
#> $series <fct> site1_sp_1_1, site1_sp_1_2, site1_sp_1_3, site2_sp_1_1, site… #>$ time      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> \$ cap       <dbl> 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, …
#>    y  abund_cov abund_fac     det_cov   det_cov2 replicate  site species
#> 1  1 -0.3734384         3 -1.28279990  2.0304731         1 site1    sp_1
#> 2 NA -0.3734384         3 -0.08474811 -1.4245916         2 site1    sp_1
#> 3 NA -0.3734384         3  0.44789392  1.6849734         3 site1    sp_1
#> 4 NA  0.7064305         4  1.71731815  0.7502679         1 site2    sp_1
#> 5  2  0.7064305         4  0.19548086  1.0455536         2 site2    sp_1
#> 6  2  0.7064305         4  0.96730338  1.9197118         3 site2    sp_1
#>         series time cap
#> 1 site1_sp_1_1    1  33
#> 2 site1_sp_1_2    1  33
#> 3 site1_sp_1_3    1  33
#> 4 site2_sp_1_1    1  33
#> 5 site2_sp_1_2    1  33
#> 6 site2_sp_1_3    1  33

The final step for data preparation is of course the trend_map, which sets up the mapping between observation replicates and the latent abundance models. This is done in the same way as in the example above

mod_data %>%
# each unique combination of site*species is a separate process
dplyr::mutate(trend = as.numeric(factor(paste0(site, species)))) %>%
dplyr::select(trend, series) %>%
dplyr::distinct() -> trend_map

trend_map %>%
dplyr::arrange(trend) %>%
#>    trend         series
#> 1      1 site100_sp_1_1
#> 2      1 site100_sp_1_2
#> 3      1 site100_sp_1_3
#> 4      2 site101_sp_1_1
#> 5      2 site101_sp_1_2
#> 6      2 site101_sp_1_3
#> 7      3 site102_sp_1_1
#> 8      3 site102_sp_1_2
#> 9      3 site102_sp_1_3
#> 10     4 site103_sp_1_1
#> 11     4 site103_sp_1_2
#> 12     4 site103_sp_1_3

Now we are ready to fit a model using mvgam(). Here we will use penalized splines for each of the continuous covariate effects to detect possible nonlinear associations. We also showcase how mvgam can make use of the different approximation algorithms available in Stan by using the meanfield variational Bayes approximator (this reduces computation time to around 12 seconds for this example)

mod <- mvgam(
# effects of covariates on detection probability;
# here we use penalized splines for both continuous covariates
formula = y ~ s(det_cov, k = 4) + s(det_cov2, k = 4),

# effects of the covariates on latent abundance;
# here we use a penalized spline for the continuous covariate and
# hierarchical intercepts for the factor covariate
trend_formula = ~ s(abund_cov, k = 4) +
s(abund_fac, bs = 're'),

# link multiple observations to each site
trend_map = trend_map,

# nmix() family and supplied data
family = nmix(),
data = mod_data,

# standard normal priors on key regression parameters
priors = c(prior(std_normal(), class = 'b'),
prior(std_normal(), class = 'Intercept'),
prior(std_normal(), class = 'Intercept_trend'),
prior(std_normal(), class = 'sigma_raw_trend')),

# use Stan's variational inference for quicker results
algorithm = 'meanfield',

# no need to compute "series-level" residuals
residuals = FALSE,
samples = 1000)

Inspect the model summary but don’t bother looking at estimates for all individual spline coefficients. Notice how we no longer receive information on convergence because we did not use MCMC sampling for this model

summary(mod, include_betas = FALSE)
#> GAM observation formula:
#> y ~ s(det_cov, k = 3) + s(det_cov2, k = 3)
#>
#> GAM process formula:
#> ~s(abund_cov, k = 3) + s(abund_fac, bs = "re")
#>
#> Family:
#> nmix
#>
#> log
#>
#> Trend model:
#> None
#>
#> N process models:
#> 225
#>
#> N series:
#> 675
#>
#> N timepoints:
#> 1
#>
#> Status:
#> Fitted using Stan
#> 1 chains, each with iter = 1000; warmup = ; thin = 1
#> Total post-warmup draws = 1000
#>
#>
#> GAM observation model coefficient (beta) estimates:
#>              2.5% 50% 97.5% Rhat n.eff
#> (Intercept) 0.052 0.4  0.71  NaN   NaN
#>
#> Approximate significance of GAM observation smooths:
#>              edf Ref.df Chi.sq p-value
#> s(det_cov)  1.22      2   52.3  0.0011 **
#> s(det_cov2) 1.07      2  307.1  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> GAM process model coefficient (beta) estimates:
#>                    2.5%    50% 97.5% Rhat n.eff
#> (Intercept)_trend -0.25 -0.081 0.079  NaN   NaN
#>
#> GAM process model group-level estimates:
#>                           2.5%    50% 97.5% Rhat n.eff
#> mean(s(abund_fac))_trend -0.18 0.0038  0.19  NaN   NaN
#> sd(s(abund_fac))_trend    0.26 0.3900  0.56  NaN   NaN
#>
#> Approximate significance of GAM process smooths:
#>               edf Ref.df    F p-value
#> s(abund_cov) 1.19      2 2.38   0.299
#> s(abund_fac) 8.82     10 2.79   0.025 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Posterior approximation used: no diagnostics to compute

Again we can make use of marginaleffects support for interrogating the model through targeted predictions. First, we can inspect the estimated average detection probability

avg_predictions(mod, type = 'detection')
#>
#>  Estimate 2.5 % 97.5 %
#>     0.579  0.51  0.644
#>
#> Columns: estimate, conf.low, conf.high
#> Type:  detection

Next investigate estimated effects of covariates on latent abundance using the conditional_effects() function and specifying type = 'link'; this will return plots on the expectation scale

abund_plots <- plot(conditional_effects(mod,
effects = c('abund_cov',
'abund_fac')),
plot = FALSE)

The effect of the continuous covariate on expected latent abundance

abund_plots[[1]] +
ylab('Expected latent abundance')

The effect of the factor covariate on expected latent abundance, estimated as a hierarchical random effect

abund_plots[[2]] +
ylab('Expected latent abundance')

Now we can investigate estimated effects of covariates on detection probability using type = 'detection'

det_plots <- plot(conditional_effects(mod,
type = 'detection',
effects = c('det_cov',
'det_cov2')),
plot = FALSE)

The covariate smooths were estimated to be somewhat nonlinear on the logit scale according to the model summary (based on their approximate significances). But inspecting conditional effects of each covariate on the probability scale is more intuitive and useful

det_plots[[1]] +
ylab('Pr(detection)')

det_plots[[2]] +
ylab('Pr(detection)')

More targeted predictions are also easy with marginaleffects support. For example, we can ask: How does detection probability change as we change both detection covariates?

fivenum_round = function(x)round(fivenum(x, na.rm = TRUE), 2)

plot_predictions(mod,
newdata = datagrid(det_cov = unique,
det_cov2 = fivenum_round),
by = c('det_cov', 'det_cov2'),
type = 'detection') +
theme_classic() +
ylab('Pr(detection)')

The model has found support for some important covariate effects, but of course we’d want to interrogate how well the model predicts and think about possible spatial effects to capture unmodelled variation in latent abundance (which can easily be incorporated into both linear predictors using spatial smooths).

I’m actively seeking PhD students and other researchers to work in the areas of ecological forecasting, multivariate model evaluation and development of mvgam. Please reach out if you are interested (n.clark’at’uq.edu.au)