Abstract
The (m,n)-mer is a new statistical feature based upon conditional frequencies (conditional probability density distributions). Here, we introduce the readNumFASTA and mnmer function, demonstrating its practical application in supervised classification.
The (m,n)-mer is a k-mer variant that expresses conditional probabilities instead of unconditional ones (Figure 1). Our proposal suggests using the relative frequency of an n-mer conditioned to its preceding m-mer, rather than the unconditional k-mer frequency. When m+n=k, this new feature has an identical size and range to k-mer.
Here, we consider k-mers as an particular case of (m,n)-mers, where k-mers are specifically (0,k)-mers (Figure 2). All other (m,n)-mers that share the same size and range as a k-mer can be categorized as follows: (1,k-1), (2,k-2), up to (k-1,1). For instance, the following three m-n mers: (1,3)-mer, (2,2)-mer and (3, 1)-mer all have the same characteristics as their related k-mer, which is in this case the (0,4)-mer.
For more methodological details and feature performance comparison in binary, multiclass and clustering classifications, please see Andrade et al., 2022 (in press).
The R package is used to summarize biological data by generating a table that shows the conditional frequency distribution of selected (m,n)-mers in FASTA files. This is an alternative to using k-mers for numerical characterization. By combining this output with class information, a feature matrix suitable for classification is created.
The output table (Figure 4) includes the FASTA file accession numbers as an ID column, the relative frequency of (m,n)-mers up to \(4.k\) columns, and class information.
The package needs R(>= 4.2.0), Biostrings(>= 3.1) and Utils(>= 2.0.0).
The user should install the package from the GitHub repository. It can be done by using the package.
library(devtools)
install_github("labinfo-lncc/mnmer", ref="main")
This function employs Biostrings for reading FASTA files into the R system. It enables users to limit the number of sequences loaded, select sequences at random, and set a non-ACTG base cutoff percentage.
The parameters are:
FASTAfile
= It could be a multiFASTA.
size
= Number of sequences to be loaded.
rand
= Select sequences randomly or not. Set TRUE or
FALSE
pni
= Percentage of non-ACTG (default = 0.20)
As default, all sequences more than 20% of N + IUPAC bases will be
removed from further analysis given the little informative nature of
those bases. In that case, the mnmer
function prints the
following warning:
## [1] "Warning: Sequence has a proportion of N + IUPAC bases = 30%"
If the user would like to accept these sequences, the
pni
parameter should be set to 0.00
The readNumFASTA
function returns an DNAStringSet data
structure, used by the mnmer
in further analysis. To learn
more about DNAStringSet please check Biostrings documentation.
The main function of this package is the mnmer
function.
It creates dataframes with the conditional probability. By invoking the
function cmmer
from the C++ script, this function can
create both k-mers and (m,n)-mers.
The parameters receives:
seqs
= DNAStringSet object
m
= Value of k for k-mer generation. Needs to be
different from zero.
n
= Value of m for (m,n)-mer generation in the format of
(m, k-m). In case of k-mer generation, m should be zero as (0,k).
Assume we need to classify viruses that infect human from viruses that infect plants. The corresponding FASTA files can be found in the extdata folder.
After package installation, the user should run:
library("mnmer")
dir <-system.file("extdata", package="mnmer")
From the readNumFASTA
function we load biological
sequences into R system as an DNAStringSet structure. In our practical
example, we would like to load 1,000 random sequences per FASTA. These
sequences must have less than 50% of non-ACTG bases. We should run:
human <-readNumFASTA((file.path(dir, "human_vir.fasta")), 10,TRUE,0.50)
plant <-readNumFASTA((file.path(dir, "plant_vir.fasta")), 10,TRUE,0.50)
For k-mer generation, the parameter k is set to choice, while the parameter m is set to zero. Given that the k-mers have been conditioned to zero bases.
human_02mer <- mnmer(human,2,0)
plant_02mer <- mnmer(plant,2,0)
The user specifies the k and m parameters for (mn)-mer generation.
For example, k = 2 and m = 1 produce the (1,1)-mer, in which one base is conditioned on the frequency of the base before it. Bases other than A, C, T, and G were disregarded.
human_21mer <- mnmer(human,2,1)
plant_21mer <- mnmer(plant,2,1)
Here, we utilize the (1,1)-mer feature matrices generated by the mnmer to run an classification using Caret.
Caret (https://topepo.github.io/caret/) is a library of functions for building predictive models from R systems. We utilized the createDataPartition, trainControl, and train functions in this example. The createDataPartition method splits the feature matrix and creates the train and test datasets. The trainControl function generates parameters that further regulate how the train function creates models.
To execute the example, enter the following code:
library(caret)
# Add class information
classes <- replicate(nrow(human_21mer), "human.vir")
featureMatrix_human_21mer <- cbind(human_21mer,classes)
classes <- replicate(nrow(plant_21mer), "plant.vir")
featureMatrix_plant_21mer <- cbind(plant_21mer,classes)
featureMatrix <- rbind(featureMatrix_human_21mer, featureMatrix_plant_21mer)
featureMatrix <- subset(featureMatrix, select = -c(seqid))
# Machine Learning
train_index <- caret::createDataPartition(featureMatrix$classes, p=0.8, list=FALSE)
train <- featureMatrix[train_index, ]
test <- featureMatrix[-train_index, ]
control <- caret::trainControl(method="cv",
summaryFunction=twoClassSummary,
classProbs=TRUE,
savePredictions = TRUE)
roc <- caret::train(classes ~ .,
data=train,
method="rf",
preProc=c("center"),
trControl=control)
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 12 (bookworm)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libmkl_rt.so; LAPACK version 3.8.0
##
## locale:
## [1] LC_CTYPE=pt_BR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=pt_BR.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=pt_BR.UTF-8
## [7] LC_PAPER=pt_BR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/Sao_Paulo
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 fastmap_1.1.1 xfun_0.39
## [5] cachem_1.0.7 knitr_1.42 htmltools_0.5.5 rmarkdown_2.21
## [9] cli_3.6.1 sass_0.4.5 jquerylib_0.1.4 compiler_4.3.0
## [13] tools_4.3.0 evaluate_0.20 bslib_0.4.2 yaml_2.3.7
## [17] rlang_1.1.1 jsonlite_1.8.4