## Introduction

The protr package offers a unique and comprehensive toolkit for generating various numerical representation schemes of protein sequences. The descriptors included are extensively utilized in bioinformatics and chemogenomics research.

The commonly used descriptors listed in protr include amino acid composition, autocorrelation, CTD, conjoint traid, quasi-sequence order, pseudo amino acid composition, and profile-based descriptors derived by Position-Specific Scoring Matrix (PSSM).

The descriptors for proteochemometric (PCM) modeling, includes the scales-based descriptors derived by principal components analysis, factor analysis, multidimensional scaling, amino acid properties (AAindex), 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.), and BLOSUM/PAM matrix-derived descriptors.

The protr package also implemented parallelized similarity computation derived by pairwise protein sequence alignment and Gene Ontology (GO) semantic similarity measures. ProtrWeb, the web application built on protr, can be accessed from http://protr.org.

If you find protr useful in your research, please feel free to cite our paper:

Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. (2015). protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31 (11), 1857-1859.

BibTeX entry:

@article{Xiao2015,
author  = {Xiao, Nan and Cao, Dong-Sheng and Zhu, Min-Feng and Xu, Qing-Song.},
title   = {protr/{ProtrWeb}: {R} package and web server for generating various numerical representation schemes of protein sequences},
journal = {Bioinformatics},
year    = {2015},
volume  = {31},
number  = {11},
pages   = {1857--1859},
doi     = {10.1093/bioinformatics/btv042}
}

## An example predictive modeling workflow

Here we use the subcellular localization dataset of human proteins presented in Chou and Shen (2008) to demonstrate the basic usage of protr.

The complete dataset includes 3,134 protein sequences (2,750 different proteins), classified into 14 human subcellular locations. We selected two classes of proteins as our benchmark dataset. Class 1 contains 325 extracell proteins, and class 2 includes 307 mitochondrion proteins. Here we aim to build a random forest classification model to classify these two types of proteins.

First, we load the protr package, then read the protein sequences stored in two separated FASTA files with readFASTA():

library("protr")

# (system.file is for accessing example file in protr package,
# replace it with your path)
"protseq/extracell.fasta",
package = "protr"
))
"protseq/mitochondrion.fasta",
package = "protr"
))

To read protein sequences stored in PDB format files, use readPDB() instead. The loaded sequences will be stored as two lists in R, and each component in the list is a character string representing one protein sequence. In this case, there are 325 extracell protein sequences and 306 mitonchon protein sequences:

length(extracell)
## [1] 325
length(mitonchon)
## [1] 306

To ensure that the protein sequences only have the 20 standard amino acid types which is usually required for the descriptor computation, we use the protcheck() function to do the amino acid type sanity check and remove the non-standard sequences:

extracell <- extracell[(sapply(extracell, protcheck))]
mitonchon <- mitonchon[(sapply(mitonchon, protcheck))]
length(extracell)
## [1] 323
length(mitonchon)
## [1] 304

Two protein sequences were removed from each class. For the remaining sequences, we calculate the Type II PseAAC descriptor, i.e., the amphiphilic pseudo amino acid composition (APseAAC) descriptor (Chou, 2005) and make class labels for classification modeling.

# calculate APseAAC descriptors
x1 <- t(sapply(extracell, extractAPAAC))
x2 <- t(sapply(mitonchon, extractAPAAC))
x <- rbind(x1, x2)

# make class labels
labels <- as.factor(c(rep(0, length(extracell)), rep(1, length(mitonchon))))

In protr, the functions of commonly used descriptors for protein sequences and proteochemometric (PCM) modeling descriptors are named after extract...().

Next, we will split the data into a 75% training set and a 25% test set.

set.seed(1001)

# split training and test set
tr.idx <- c(
sample(1:nrow(x1), round(nrow(x1) * 0.75)),
sample(nrow(x1) + 1:nrow(x2), round(nrow(x2) * 0.75))
)
te.idx <- setdiff(1:nrow(x), tr.idx)

x.tr <- x[tr.idx, ]
x.te <- x[te.idx, ]
y.tr <- labels[tr.idx]
y.te <- labels[te.idx]

We will train a random forest classification model on the training set with 5-fold cross-validation, using the randomForest package.

library("randomForest")
rf.fit <- randomForest(x.tr, y.tr, cv.fold = 5)
rf.fit

The training result is:

## Call:
##  randomForest(x = x.tr, y = y.tr, cv.fold = 5)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
##
##         OOB estimate of  error rate: 25.11%
## Confusion matrix:
##     0   1 class.error
## 0 196  46   0.1900826
## 1  72 156   0.3157895

With the model trained on the training set, we predict on the test set and plot the ROC curve with the pROC package, as is shown in Figure 1.

# predict on test set
rf.pred <- predict(rf.fit, newdata = x.te, type = "prob")[, 1]

# plot ROC curve
library("pROC")
plot.roc(y.te, rf.pred, grid = TRUE, print.auc = TRUE)

The area under the ROC curve (AUC) is:

## Call:
## plot.roc.default(x = y.te, predictor = rf.pred, col = "#0080ff",
##                  grid = TRUE, print.auc = TRUE)
##
## Data: rf.pred in 81 controls (y.te 0) > 76 cases (y.te 1).
## Area under the curve: 0.8697

Figure 1: ROC curve for the test set of protein subcellular localization data

## Package overview

The protr package (Xiao et al., 2015) implemented most of the state-of-the-art protein sequence descriptors with R. Generally, each type of the descriptors (features) can be calculated with a function named extractX() in the protr package, where X stands for the abbrevation of the descriptor name. The descriptors are:

• Amino acid composition
• extractAAC() - Amino acid composition
• extractDC() - Dipeptide composition
• extractTC() - Tripeptide composition
• Autocorrelation
• extractMoreauBroto() - Normalized Moreau-Broto autocorrelation
• extractMoran() - Moran autocorrelation
• extractGeary() - Geary autocorrelation
• CTD descriptors
• extractCTDC() - Composition
• extractCTDT() - Transition
• extractCTDD() - Distribution
• extractCTriad() - Conjoint triad descriptors
• Quasi-sequence-order descriptors
• extractSOCN() - Sequence-order-coupling number
• extractQSO() - Quasi-sequence-order descriptors
• Pseudo-amino acid composition
• extractPAAC() - Pseudo-amino acid composition (PseAAC)
• extractAPAAC() - Amphiphilic pseudo-amino acid composition (APseAAC)
• Profile-based descriptors
• extractPSSM()
• extractPSSMAcc()
• extractPSSMFeature()

The descriptors commonly used in Proteochemometric Modeling (PCM) implemented in protr include:

• extractScales(), extractScalesGap() - Scales-based descriptors derived by Principal Components Analysis
• extractProtFP(), extractProtFPGap() - Scales-based descriptors derived by amino acid properties from AAindex (a.k.a. Protein Fingerprint)
• extractDescScales() - Scales-based descriptors derived by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.)
• extractFAScales() - Scales-based descriptors derived by Factor Analysis
• extractMDSScales() - Scales-based descriptors derived by Multidimensional Scaling
• extractBLOSUM() - BLOSUM and PAM matrix-derived descriptors

The protr package integrates the function of parallelized similarity score computation derived by local or global protein sequence alignment between a list of protein sequences, the sequence alignment computation is provided by Biostrings, the corresponding functions listed in the protr package include:

• twoSeqSim() - Similarity calculation derived by sequence alignment between two protein sequences
• parSeqSim() - Parallelized pairwise similarity calculation with a list of protein sequences (in-memory version)
• parSeqSimDisk() - Parallelized pairwise similarity calculation with a list of protein sequences (disk-based version)

The protr package also integrates the function of parallelized similarity score computation derived by Gene Ontology (GO) semantic similarity measures between a list of GO terms / Entrez Gene IDs, the GO similarity computation is provided by GOSemSim, the corresponding functions listed in the protr package include:

• twoGOSim() - Similarity calculation derived by GO-terms semantic similarity measures between two GO terms / Entrez Gene IDs;
• parGOSim() - Pairwise similarity calculation with a list of GO terms / Entrez Gene IDs.

To use the parSeqSim() function, we suggest the users to install the packages foreach and doParallel first, in order to make the parallelized pairwise similarity computation available.

In the following sections, we will introduce the descriptors and function usage in this order.

## Commonly used descriptors

Note: Users need to intelligently evaluate the underlying details of the descriptors provided, instead of using protr with their data blindly, especially for the descriptor types with more flexibility. It would be wise for the users to use some negative and positive control comparisons where relevant to help guide interpretation of the results.

A protein or peptide sequence with $$N$$ amino acid residues can be generally represented as $$\{\,R_1, R_2, \ldots, R_n\,\}$$, where $$R_i$$ represents the residue at the $$i$$-th position in the sequence. The labels $$i$$ and $$j$$ are used to index amino acid position in a sequence, and $$r$$, $$s$$, $$t$$ are used to represent the amino acid type. The computed descriptors are roughly categorized into 4 groups according to their major applications.

A protein sequence can be partitioned equally into segments. The descriptors designed for the complete sequence, can be often applied to each individual segment.

### Amino acid composition descriptor

The amino acid composition describes the fraction of each amino acid type within a protein sequence. The fractions of all 20 natural amino acids are calculated as:

$f(r) = \frac{N_r}{N} \quad r = 1, 2, \ldots, 20.$ where $$N_r$$ is the number of the amino acid type $$r$$ and $$N$$ is the length of the sequence.

As was described above, we can use the function extractAAC() to extract the descriptors (features) from protein sequences:

library("protr")

"protseq/P00750.fasta",
package = "protr"
))[[1]]

extractAAC(x)
#>          A          R          N          D          C          E          Q
#> 0.06405694 0.07117438 0.03914591 0.05160142 0.06761566 0.04804270 0.04804270
#>          G          H          I          L          K          M          F
#> 0.08185053 0.03024911 0.03558719 0.07651246 0.03914591 0.01245552 0.03202847
#>          P          S          T          W          Y          V
#> 0.05338078 0.08896797 0.04448399 0.02313167 0.04270463 0.04982206

Here, with the function readFASTA() we loaded a single protein sequence (P00750, Tissue-type plasminogen activator) from a FASTA format file. Then extracted the AAC descriptors with extractAAC(). The result returned is a named vector, whose elements are tagged with the name of each amino acid.

### Dipeptide composition descriptor

Dipeptide composition gives a 400-dimensional descriptor, defined as:

$f(r, s) = \frac{N_{rs}}{N - 1} \quad r, s = 1, 2, \ldots, 20.$

where $$N_{rs}$$ is the number of dipeptide represented by amino acid type $$r$$ and type $$s$$. Similar to extractAAC(), here we use extractDC() to compute the descriptors:

dc <- extractDC(x)
#>          AA          RA          NA          DA          CA          EA
#> 0.003565062 0.003565062 0.000000000 0.007130125 0.003565062 0.003565062
#>          QA          GA          HA          IA          LA          KA
#> 0.007130125 0.007130125 0.001782531 0.003565062 0.001782531 0.001782531
#>          MA          FA          PA          SA          TA          WA
#> 0.000000000 0.005347594 0.003565062 0.007130125 0.003565062 0.000000000
#>          YA          VA          AR          RR          NR          DR
#> 0.000000000 0.000000000 0.003565062 0.007130125 0.005347594 0.001782531
#>          CR          ER          QR          GR          HR          IR
#> 0.005347594 0.005347594 0.000000000 0.007130125 0.001782531 0.003565062

Here we only showed the first 30 elements of the result vector and omitted the rest of the result. The element names of the returned vector are self-explanatory as before.

### Tripeptide composition descriptor

Tripeptide composition gives a 8000-dimensional descriptor, defined as:

$f(r, s, t) = \frac{N_{rst}}{N - 2} \quad r, s, t = 1, 2, \ldots, 20$

where $$N_{rst}$$ is the number of tripeptides represented by amino acid type $$r$$, $$s$$, and $$t$$. With function extractTC(), we can easily obtain the length-8000 descriptor, to save some space, here we also omitted the long outputs:

tc <- extractTC(x)
#>         AAA         RAA         NAA         DAA         CAA         EAA
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#>         QAA         GAA         HAA         IAA         LAA         KAA
#> 0.001785714 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#>         MAA         FAA         PAA         SAA         TAA         WAA
#> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000
#>         YAA         VAA         ARA         RRA         NRA         DRA
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#>         CRA         ERA         QRA         GRA         HRA         IRA
#> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000
#>         LRA         KRA         MRA         FRA         PRA         SRA
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000

### Autocorrelation descriptors

Autocorrelation descriptors are defined based on the distribution of amino acid properties along the sequence. The amino acid properties used here are various types of amino acids index (Retrieved from the AAindex database, see Kawashima et al. (1999), Kawashima and Kanehisa (2000), and Kawashima et al. (2008); see Figure 2 for an illustrated example). Three types of autocorrelation descriptors are defined here and described below.

All the amino acid indices are centralized and standardized before the calculation, i.e.

$P_r = \frac{P_r - \bar{P}}{\sigma}$

where $$\bar{P}$$ is the average of the property of the 20 amino acids:

$\bar{P} = \frac{\sum_{r=1}^{20} P_r}{20} \quad \textrm{and} \quad \sigma = \sqrt{\frac{1}{2} \sum_{r=1}^{20} (P_r - \bar{P})^2}$

Figure 2: An illustrated example in the AAIndex database

#### Normalized Moreau-Broto autocorrelation descriptors

For protein sequences, the Moreau-Broto autocorrelation descriptors can be defined as:

$AC(d) = \sum_{i=1}^{N-d} P_i P_{i + d} \quad d = 1, 2, \ldots, \textrm{nlag}$

where $$d$$ is called the lag of the autocorrelation; $$P_i$$ and $$P_{i+d}$$ are the properties of the amino acids at position $$i$$ and $$i+d$$; $$\textrm{nlag}$$ is the maximum value of the lag.

The normalized Moreau-Broto autocorrelation descriptors are defined as:

$ATS(d) = \frac{AC(d)}{N-d} \quad d = 1, 2, \ldots, \textrm{nlag}$

The corresponding function for this descriptor is extractMoreauBroto(). A typical call would be:

moreau <- extractMoreauBroto(x)
#>  CIDH920105.lag1  CIDH920105.lag2  CIDH920105.lag3  CIDH920105.lag4
#>      0.081573213     -0.016064817     -0.015982990     -0.025739038
#>  CIDH920105.lag5  CIDH920105.lag6  CIDH920105.lag7  CIDH920105.lag8
#>      0.079058632     -0.042771564     -0.036320847      0.024087298
#>  CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12
#>     -0.005273958      0.052274763      0.082170073      0.005419919
#> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16
#>      0.083292042      0.004810584      0.001872446     -0.001531495
#> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20
#>     -0.011917230      0.071161551      0.033473197      0.026882737
#> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24
#>      0.073075402      0.115272790      0.041517897     -0.027025993
#> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28
#>      0.033477388     -0.003245255      0.078117010     -0.028177304
#> CIDH920105.lag29 CIDH920105.lag30  BHAR880101.lag1  BHAR880101.lag2
#>      0.046695832      0.020584423      0.052740185      0.030804784
#>  BHAR880101.lag3  BHAR880101.lag4  BHAR880101.lag5  BHAR880101.lag6
#>      0.037170476     -0.058993771      0.070641780     -0.089192490

The eight default properties used here are:

• AccNo. CIDH920105 - Normalized Average Hydrophobicity Scales
• AccNo. BHAR880101 - Average Flexibility Indices
• AccNo. CHAM820101 - Polarizability Parameter
• AccNo. CHAM820102 - Free Energy of Solution in Water, kcal/mole
• AccNo. CHOC760101 - Residue Accessible Surface Area in Tripeptide
• AccNo. BIGC670101 - Residue Volume
• AccNo. CHAM810101 - Steric Parameter
• AccNo. DAYM780201 - Relative Mutability

Users can change the property names of AAindex database with the argument props. The AAindex data shipped with protr can be loaded by data(AAindex), which has the detailed information of each property. With the argument customprops and nlag, users can specify their own properties and lag value to calculate with. For example:

# Define 3 custom properties
myprops <- data.frame(
AccNo = c("MyProp1", "MyProp2", "MyProp3"),
A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101),
N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59),
C = c(0.29, -1, 47), E = c(-0.74, 3, 73),
Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1),
H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57),
L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73),
M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91),
P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31),
T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130),
Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)
)

# Use 4 properties in the AAindex database, and 3 cutomized properties
moreau2 <- extractMoreauBroto(
x,
customprops = myprops,
props = c(
"CIDH920105", "BHAR880101",
"CHAM820101", "CHAM820102",
"MyProp1", "MyProp2", "MyProp3"
)
)

#>  CIDH920105.lag1  CIDH920105.lag2  CIDH920105.lag3  CIDH920105.lag4
#>      0.081573213     -0.016064817     -0.015982990     -0.025739038
#>  CIDH920105.lag5  CIDH920105.lag6  CIDH920105.lag7  CIDH920105.lag8
#>      0.079058632     -0.042771564     -0.036320847      0.024087298
#>  CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12
#>     -0.005273958      0.052274763      0.082170073      0.005419919
#> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16
#>      0.083292042      0.004810584      0.001872446     -0.001531495
#> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20
#>     -0.011917230      0.071161551      0.033473197      0.026882737
#> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24
#>      0.073075402      0.115272790      0.041517897     -0.027025993
#> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28
#>      0.033477388     -0.003245255      0.078117010     -0.028177304
#> CIDH920105.lag29 CIDH920105.lag30  BHAR880101.lag1  BHAR880101.lag2
#>      0.046695832      0.020584423      0.052740185      0.030804784
#>  BHAR880101.lag3  BHAR880101.lag4  BHAR880101.lag5  BHAR880101.lag6
#>      0.037170476     -0.058993771      0.070641780     -0.089192490

About the standard input format of props and customprops, see ?extractMoreauBroto for details.

#### Moran autocorrelation descriptors

For protein sequences, the Moran autocorrelation descriptors can be defined as:

$I(d) = \frac{\frac{1}{N-d} \sum_{i=1}^{N-d} (P_i - \bar{P}') (P_{i+d} - \bar{P}')}{\frac{1}{N} \sum_{i=1}^{N} (P_i - \bar{P}')^2} \quad d = 1, 2, \ldots, 30$

where $$d$$, $$P_i$$, and $$P_{i+d}$$ are defined in the same way as in the first place; $$\bar{P}'$$ is the considered property $$P$$ along the sequence, i.e.,

$\bar{P}' = \frac{\sum_{i=1}^N P_i}{N}$

$$d$$, $$P$$, $$P_i$$ and $$P_{i+d}$$, $$\textrm{nlag}$$ have the same meaning as above.

With extractMoran() (which has the identical parameters as extractMoreauBroto()), we can compute the Moran autocorrelation descriptors (only print out the first 36 elements):

# Use the 3 custom properties defined before
# and 4 properties in the AAindex database
moran <- extractMoran(
x,
customprops = myprops,
props = c(
"CIDH920105", "BHAR880101",
"CHAM820101", "CHAM820102",
"MyProp1", "MyProp2", "MyProp3"
)
)

#>  CIDH920105.lag1  CIDH920105.lag2  CIDH920105.lag3  CIDH920105.lag4
#>      0.062895724     -0.044827681     -0.045065117     -0.055955678
#>  CIDH920105.lag5  CIDH920105.lag6  CIDH920105.lag7  CIDH920105.lag8
#>      0.060586377     -0.074128412     -0.067308852     -0.001293384
#>  CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12
#>     -0.033747588      0.029392193      0.061789800     -0.023368437
#> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16
#>      0.062769417     -0.024912264     -0.028298043     -0.031584063
#> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20
#>     -0.043466730      0.047830694      0.005883901     -0.001769769
#> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24
#>      0.049334048      0.096427969      0.015147594     -0.060092509
#> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28
#>      0.007549152     -0.033987885      0.056307675     -0.061844453
#> CIDH920105.lag29 CIDH920105.lag30  BHAR880101.lag1  BHAR880101.lag2
#>      0.021484780     -0.008461776      0.014229951     -0.009142419
#>  BHAR880101.lag3  BHAR880101.lag4  BHAR880101.lag5  BHAR880101.lag6
#>     -0.003272262     -0.109613332      0.033346233     -0.141538598

#### Geary autocorrelation descriptors

Geary autocorrelation descriptors for protein sequences can be defined as:

$C(d) = \frac{\frac{1}{2(N-d)} \sum_{i=1}^{N-d} (P_i - P_{i+d})^2}{\frac{1}{N-1} \sum_{i=1}^{N} (P_i - \bar{P}')^2} \quad d = 1, 2, \ldots, 30$

where $$d$$, $$P$$, $$P_i$$, $$P_{i+d}$$, and $$\textrm{nlag}$$ have the same meaning as above.

For each amino acid index, there will be $$3 \times \textrm{nlag}$$ autocorrelation descriptors. The usage of extractGeary() is exactly the same as extractMoreauBroto() and extractMoran():

# Use the 3 custom properties defined before
# and 4 properties in the AAindex database
geary <- extractGeary(
x,
customprops = myprops,
props = c(
"CIDH920105", "BHAR880101",
"CHAM820101", "CHAM820102",
"MyProp1", "MyProp2", "MyProp3"
)
)

#>  CIDH920105.lag1  CIDH920105.lag2  CIDH920105.lag3  CIDH920105.lag4
#>        0.9361830        1.0442920        1.0452843        1.0563467
#>  CIDH920105.lag5  CIDH920105.lag6  CIDH920105.lag7  CIDH920105.lag8
#>        0.9406031        1.0765517        1.0675786        0.9991363
#>  CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12
#>        1.0316555        0.9684585        0.9353130        1.0201990
#> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16
#>        0.9340933        1.0207373        1.0251486        1.0290464
#> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20
#>        1.0414375        0.9494403        0.9905987        0.9987183
#> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24
#>        0.9472542        0.9010009        0.9828848        1.0574098
#> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28
#>        0.9897955        1.0290018        0.9400066        1.0584150
#> CIDH920105.lag29 CIDH920105.lag30  BHAR880101.lag1  BHAR880101.lag2
#>        0.9762904        1.0029734        0.9818711        1.0051730
#>  BHAR880101.lag3  BHAR880101.lag4  BHAR880101.lag5  BHAR880101.lag6
#>        0.9967069        1.1012905        0.9595859        1.1337056

### Composition/Transition/Distribution

The CTD descriptors are developed by Dubchak et al. (1995) and Dubchak et al. (1999).

Figure 3: The sequence of a hypothetic protein indicating the construction of composition, transition, and distribution descriptors of a protein. Sequence index indicates the position of an amino acid in the sequence. The index for each type of amino acids in the sequence (1, 2 or 3) indicates the position of the first, second, third, … of that type of amino acid. 1/2 transition indicates the position of 12 or 21 pairs in the sequence (1/3 and 2/3 are defined in the same way).

Step 1: Sequence Encoding

The amino acids are categorized into three classes according to its attribute, and each amino acid is encoded by one of the indices 1, 2, 3 according to which class it belongs. The attributes used here include hydrophobicity, normalized van der Waals volume, polarity, and polarizability. The corresponding classification for each attribute is listed in Table 1.

Table 1: Amino acid attributes, and the three-group classification of the 20 amino acids by each attribute
Group 1 Group 2 Group 3
Hydrophobicity Polar Neutral Hydrophobicity
R, K, E, D, Q, N G, A, S, T, P, H, Y C, L, V, I, M, F, W
Normalized van der Waals Volume 0-2.78 2.95-4.0 4.03-8.08
G, A, S, T, P, D, C N, V, E, Q, I, L M, H, K, F, R, Y, W
Polarity 4.9-6.2 8.0-9.2 10.4-13.0
L, I, F, W, C, M, V, Y P, A, T, G, S H, Q, R, K, N, E, D
Polarizability 0-1.08 0.128-0.186 0.219-0.409
G, A, S, D, T C, P, N, V, E, Q, I, L K, M, H, F, R, Y, W
Charge Positive Neutral Negative
K, R A, N, C, Q, G, H, I, L, M, F, P, S, T, W, Y, V D, E
Secondary Structure Helix Strand Coil
E, A, L, M, Q, K, R, H V, I, Y, C, W, F, T G, N, P, S, D
Solvent Accessibility Buried Exposed Intermediate
A, L, F, C, G, I, V, W R, K, Q, E, N, D M, S,