The Chain Binomial model for infectious disease spread is especially suitable for modelling of small outbreaks, such as outbreaks in households. This package contains tools for analyzing data using the Chain Binomial model.

The chain Binomial model has a single parameter, which is the the secondary attack rate (SAR). The household secondary attack rate is defined as the probability that an infected household member infects a susceptible household member.

This package contains functions related to the Chain Binomial probability distribution, as well as functions for estimating the SAR parameter and regression modelling relating the SAR to predictive factors. To get started the package need to be loaded as usual:

The Chain Binomial model is a simple model for the spread of a infectious disease in a closed population, such as a household. An infectious disease is introduced into the household at time point 0 by one or more primary cases, which is denoted as \(I_0\). The disease then spreads among the remaining \(S_0\) household members, who are all susceptible to the disease, in discrete time steps called generations. In each generation 0 or more of the remaining susceptibles become infected, and the infected individuals from the previous generation is considered recovered and immune, and does not anymore contribute to the spread of the disease. The number of new infections in generation \(g+1\) is modeled as a binomial model that depends on the number of infected \(I_g\) and the number of remaining susceptibles \(S_g\), in addition to the SAR.

\[ P(I_{g+1} | I_{g}, S_g. \theta) = \binom{S_g}{I_{g+1}} \pi_{g+1}^{I_{g+1}} (1-\pi_{g+1})^{S_g - I_{g+1}} \]

where

\[ \pi_{g+1} = 1 - (1-\theta)^{I_{g}} \] is the per-person risk for getting infected, and depends on the the secondary attack rate (denoted \(\theta\)) and the number of infected individuals in generation \(g\). This is the binomial probability of getting at least one “success” in \(I_g\) binomial trials with probability \(\theta\). This model is also referred to as the Reed-Frost model.

The point of the `chainbinomial`

package is not to analyze
the number of new infections in each generation since that is already
possible using the `glm`

function already included in R. A
tutorial for doing this can be found by typing
`vignette('chain_glm', package = "chainbinomial")`

. Instead,
the goal of this package is to analyze the final size of the outbreaks,
that is, the final counts of the number of infected in each
outbreak.

The Chain binomial probability of getting an outbreak of size \(I = x\) in a household of \(S_0\) initially susceptible individuals, with \(I_0\) introductory cases is

\[ P(I = x; S_0, I_0, \theta) = \binom{S_0}{x} P(x,x) (1-\theta)^{S_0 - x + I_0} (1-\theta)^{x(S_0 - x)} \] Notice that this formula is recursive, as it also depends on \(P(x,x)\), which is given as

\[ P(I = x, S_0 = x) = 1 - \sum_{j=1}^{x-1} P(I = j, S_0 = x) \] The recursion bottoms out at

\[ P(I=1, S_0 = 1; I_0) = 1 - P(I=0, S_0 = 1; I_0) = 1 - (1-\theta)^{I_0} \]

This formula was derived by Ludwig (1975). The
`dchainbinom`

function can be used to compute these
probabilities, and it works similarly to other discrete probability mass
functions in R such as the `dbinom`

and
`dpois`

.

Consider a household with 4 persons. A single household member becomes infected by a contagious disease outside of the household, and the other 3 household members are susceptible to the disease. Assuming a secondary attack rate of 0.23, we can compute the probability that 2 of the 3 susceptible household members becomes infected as follows:

We can also compute the entire final size distribution

Suppose instead that 2 of the 4 household members were infected simultaneously outside of the household. Then \(I_0 = 2\). We can again compute the final size distribution. Note that the number of initial susceptible household members \(S_0\) is now 2.

Now suppose that we don’t have observed the entire outbreak, but have observed the outbreak for a time corresponding to two generations. There is no simple formula for the probability of an outbreak after a given number of generations, but it is possible to compute it by considering all possible scenarios that lead to the desired number of cases, see Lindstrøm et al. (2024).

The entire probability distribution after 2 generations can be
computed using the `generations`

argument. By default the
`generations`

argument is Inf, meaning that the outbreak is
assumed to be completely observed.

The data we will look at comes from a study of of the common cold in
66 families, all of which consisted of a mother, father, and three
children (Brimblecombe et al., 1958). In total 664 outbreaks were
recorded in these families over a period of 1 and a half year. The data
was analyzed by Heasman and Reid in a 1961 paper, where each infection
in an outbreak was classified according to who the index case was. The
data is included in the package as
`heasman_reid_1961_intro_case_status`

and contains the data
from Table II in the 1961 paper.

Lets take a look at the data:

```
heasman_reid_1961_intro_case_status
#> furter_cases father mother school_child pre_school_child
#> 1 0 53 75 148 147
#> 2 1 31 25 77 66
#> 3 2 4 4 22 9
#> 4 3 0 1 2 0
#> 5 4 0 0 0 0
```

The table counts the number of outbreaks that falls into each category (type of index case by number of infected). For analysis we need to make the data into a suitable long format, with one row for each outbreak.

```
library(dplyr)
library(tidyr)
heasman_reid_1961_intro_case_status %>%
pivot_longer(cols = -1,
names_to = 'intro_case',
values_to = 'N') %>%
uncount(weights = N) -> intro_case_status_long
head(intro_case_status_long)
#> # A tibble: 6 × 2
#> furter_cases intro_case
#> <int> <chr>
#> 1 0 father
#> 2 0 father
#> 3 0 father
#> 4 0 father
#> 5 0 father
#> 6 0 father
```

For the purpose of illustration, we will only estimate the SAR for
outbreaks where the fathers were the index case. We can estimate the SAR
using the `estimate_sar`

function. We need to give it the
number of infected and `s0`

as input. `s0`

will be
4 in this case, since number of index cases is always 1 and all families
are of size 5. We can also give the arguments `i0`

and
`generations`

as in the `dchainbinom`

function,
but the default values are the correct ones in this case
(`i0 = 1`

and `generations = Inf`

).

```
intro_case_status_long %>%
filter(intro_case == 'father') -> intro_case_status_long_fathers
sar_est <- estimate_sar(infected = intro_case_status_long_fathers$furter_cases, s0 = 4)
sar_est$sar_hat
#> [1] 0.08412651
```

Now let us take a look at the point estimate:

We can also compute the confidence intervals:

With an estimate of the SAR in families where the primary case was
the father, a natural question would be what the SAR is when other
family members are the primary case. The `cbmod`

function let
us do a regression analysis similar to the `glm`

function,
with predictors for the SAR. The `cbmod`

function does not
implement the formula interface as the `glm`

function does,
but you can use the `model.matrix`

function instead.

Note that `s0`

, `i0`

, and
`generations`

should not be thought of as predictor variables
in the traditional sense and should in general not be included in the X
matrix.

You can also specify the link function to be used. Here we specify
the `identity`

link, which gives coefficients that are easy
to interpret. Other options are `log`

, `logit`

,
and `cloglog`

. The `identity`

link might not be
suitable if there are more than one predictor or the predictor is
numerical instead of categorical.

```
xmat <- model.matrix(~ intro_case, data = intro_case_status_long)
cbmod_res <- cbmod(y = intro_case_status_long$furter_cases,
s0 = rep(4, nrow(intro_case_status_long)),
x = xmat,
i0 = 1,
link = 'identity')
summary(cbmod_res)
#> Chain Binomial model with identity link.
#> Model successfully fitted in 2.37 seconds
#>
#> Model log-likelihood: -582.5
#> Null log-likelihood: -584.9
#> Chisq (df = 3): 4.808
#> p-value: 0.186
#>
#> Coefficients:
#> Estimate Std. Error P-value
#> (Intercept) 0.084 0.013 0.000
#> intro_casemother -0.015 0.017 0.388
#> intro_casepre_school_child -0.010 0.015 0.522
#> intro_caseschool_child 0.011 0.015 0.480
```

The output from the summary function gives you the coefficients and the associated standard errors and p-values of the null hypothesis that the coefficient is 0. Above the table of coefficients there are also an omnibus test of the entire model that tests if the model has better fit than a model with intercept-only (the null model).

Confidence intervals for all the coefficients can also be computed:

```
confint(cbmod_res)
#> 2.5 % 97.5 %
#> (Intercept) 0.05886205 0.10939728
#> intro_casemother -0.04812925 0.01871199
#> intro_casepre_school_child -0.03919720 0.01990321
#> intro_caseschool_child -0.01903459 0.04045655
```

`predict`

, `vcov`

, `coef`

,
`tidy`

, and `glance`

methods are also available
for `cbmod`

objects.

Ludwig, D. (1975) Final Size Distributions for Epidemics. Mathematical Biosciences, 23, 33-46. https://doi.org/10.1016/0025-5564(75)90119-4

Lindstrøm JC, et al (2024) Estimating the household secondary attack rate with the Incomplete Chain Binomial model. https://doi.org/10.48550/arXiv.2403.03948

Brimblecombe et al. (1958) Family Studies of Respiratory Infections

Heasman, M.A. and Reid, D.D. (1961) Theory and Observation in Family Epidemics of the Common Cold