R/mbmixture is an R package for evaluating whether a microbiome sample is the mixture of two source samples. We are thinking of shotgun sequencing data on the microbiome sample plus dense SNP genotype data on the two potential source samples. We assume that the data has been reduced to a three-dimensional array of read counts: the 3 possible SNP genotypes for the first sample × the 3 possible SNP genotypes of the second sample × the 2 possible SNP alleles on the reads.

The data are very specific, and it fits a very specific model.

It comes with a single example data set, `mbmixdata`

.
Let’s load the package and the data.

```
library(mbmixture)
data(mbmixdata)
```

The data set is a three-dimensional array, 3×3×2, with read counts at
SNPs broken down by the SNP genotype for the first sample, the SNP
genotype for the second sample, and the SNP allele in the read. It seeks
to fit a multinomial model with two parameters: *p* is the
“contaminant probability” (the proportion of the microbiome reads coming
from the second sample), and *e* is the sequencing error
rate.

Here’s a view of the data:

` mbmixdata`

```
## , , A
##
## AA AB BB
## AA 3684249 977787 49243
## AB 1293728 545460 36227
## BB 219492 146964 1058
##
## , , B
##
## AA AB BB
## AA 10159 554641 134408
## AB 185128 519155 232498
## BB 72721 231074 202412
```

The function `mbmix_loglik`

is not generally called by the
user, but calculates the log likelihood for particular values of
*p* and *e*.

`mbmix_loglik(mbmixdata, p=0.74, e=0.002)`

`## [1] -3006664`

The function `mle_pe`

is the key one. It obtains the MLEs
of *p* and *e*.

`<- mle_pe(mbmixdata)) (mle `

```
## p e loglik lrt_p0
## 0.744841169 0.002842999 -3005967.168629832 3611308.879459433
## lrt_p1
## 820483.428998603
```

So for these data, the estimate of *p* is 0.74, meaning that
74% of the microbiome sample came from the second sample (and 26% came
from the first sample). The estimate of the sequencing error rate is
0.003.

The `mle_pe()`

function also returns likelihood ratio test
statistics (twice the natural log of the likelihood ratio) for tests of
*p=0* and *p=1*. Here, they’re both extremely large.

If you use the argument `SE=TRUE`

, you’ll get estimated
standard errors as an attribute.

`<- mle_pe(mbmixdata, SE=TRUE)) (mle_w_SE `

```
## p e loglik lrt_p0
## 0.744841169 0.002842999 -3005967.168629832 3611308.879459433
## lrt_p1
## 820483.428998603
## attr(,"SE")
## p e
## 0.00034927110 0.00002675341
```

You can grab the SEs using the `attr()`

function.

`attr(mle_w_SE, "SE")`

```
## p e
## 0.00034927110 0.00002675341
```

There’s also a function to derive estimated standard errors via a parametric bootstrap.

`<- bootstrapSE(mbmixdata, 1000) bootstrap_se `

```
## p err
## 0.00035136095 0.00002723048
```

There are also functions to calculate the MLE of *p* for a
fixed value for *e*, and vice versa. They return the estimate,
with the log likelihood as an attribute.

`mle_p(mbmixdata, e=0.002)`

```
## [1] 0.7442465
## attr(,"loglik")
## [1] -3006590
```

`mle_e(mbmixdata, p=0.74)`

```
## [1] 0.002823757
## attr(,"loglik")
## [1] -3006063
```