Yalan He,Qian Wang,Junwei Han



Single-sample pathway enrichment analysis is an effectively approach for identifying cancer subtypes and pathway biomarkers, which promotes the development of precision medicine. However, the existing approaches focused on gene expression data but neglected gene mutation information. Here, we proposed a novel single-sample mutation-based pathway analysis approach (ssMutPA) to infer individualized pathway activities by integrating somatic mutation data and the protein-protein interaction (PPI) network.For each sample, ssMutPA first uses local and global weighted strategies to evaluate the effects of network genes from mutations according to the network topology, and then calculates mutation-based pathway enrichment score (ssMutPES) to reflect the accumulated effect of mutations of each pathway.

MAF field requirements

MAF files contain many fields ranging from chromosome names to cosmic annotations. However, most of the analysis in ssMutPA uses the following fields.

Complete specification of MAF files can be found on NCI GDC documentation page.



Overview of the package

The ssMutPA package is a Bioinformatics tool to perform single-sample mutation-based pathway analysis. And ssMutPA functions can be categorized into mainly Analysis and Visualization modules. Each of these functions and a short description is summarized as shown below:

1.Obtain the gene mutation status matrix.
2.Calculate single-sample mutation-based pathway enrichment score.
3.Identification of cancer subtypes based on unsupervised spectral clustering algorithm.
4.Visualization of results:
4.1 Plotting a heatmap with subtype labels.
4.2 Plotting the density ridges plot.
4.3 Plotting the Dot plot.
4.4 Plotting a waterfall plot of a particular pathway.

Obtain the gene mutation status matrix

We downloaded patients’ somatic mutation data from the The Cancer Genome Atlas Program(TCGA) or other open-source databases in MAF format and converted MAF format data into a gene mutation status matrix, where each element indicates whether a gene is mutated or not in a sample. In this study, we only extract the non-silent somatic mutations (Missense_Mutation,Frame_Shift_Del,Frame_Shift_Ins,In_Frame_Del,Nonsense_Mutation,In_Frame_Ins,Splice_Site,Nonstop_Mutation,Translation_Start_Site) in protein-coding regions. The function get_mut_status in the ssMutPA package can implement the above process. Simulation data as an example, the command lines are as follows:

#load the mutation annotation file
mut_path <- system.file("extdata","mutation_data.Rdata",package = "ssMutPA")
#perform the function 'get_mut_status'
#view the first five lines of mut_status matrix
#>       TCGA-32-1979-01A TCGA-4W-AA9S-01A TCGA-06-5858-01A TCGA-19-4068-01A
#> GRHL3                1                0                0                0
#> BRDT                 1                0                0                0
#> APH1A                1                0                0                0
#> OR6N2                1                0                0                0
#> APOB                 1                0                0                0
#>       TCGA-06-0644-01A
#> GRHL3                0
#> BRDT                 0
#> APH1A                0
#> OR6N2                0
#> APOB                 0

Calculate the single-sample mutation-based pathway enrichment score.

Firstly, we performed local weighted on the seed nodes. For a given seed node ($G_{i}$), there are two variables to characterize if any \(G_{i}\) may be reinforced by the mutations of its neighbors: the number of mutation genes in the neighbors of \(G_{i}\), designated as \(X_{i}\), and the number of neighbors of \(G_{i}\), designated as \(M_{i}\). For a seed \(G_{i}\) in the sample, if it play more important function in the disease progression, the number of mutation genes in its neighbors will be significantly larger than the expectation \(E(X_{i})\), which can be calculated as:

$$E(X_{i})=M_{i}K/N\tag{1}$$ Thus, a rescaled form \(X_{i}-E(X_{i})\) is proposed to quantify the important strength of \(G_{i}\) in a sample,which was defined as local weight \(W_{i}\) :


$$W_{i}=log_{α}(w_{i}I(w_{i})+α)\tag{3}$$ where \(I\) is an indicator function,if \(w_{i}\) greater than 0,its value is equal to 1,otherwise equal to 0;and \(α\) is a scalar base to guarantee the weight has a minimum value of 1, here we set \(α\) as 2.

Subsequently, we normalized the initial weights of all seeds in a sample to ensure that they sum to 1 and used the global propagation algorithm(Random Walk with Restart(RWR)) to predict probable influence of nodes in the network by seed nodes(mutation genes).The RWR formula is as follows:

###Random walk with restart $$p_{t+1}=(1-c)Ap_{t}+cp_{0}\tag{4}$$ where$c$ is a certain probability of continuing the random walk or restarting from the restart set; \(A\) is the column-normalized adjacency matrix of the PPI network; \(p_{t}\) is a vector containing visiting probabilities of all nodes in the network at time point t; \(p_{0}\) is the initial probability vector.

Finally, we constructed a gene list by ranking the genes according to the normalized global weight and calculated the degree of enrichment of genes contained in each pathway on the gene list based on the Kolmogorov-Smirnov statistic. The function get_RWR_ES in the ssMutPA package can accomplish all the processes mentioned above.

#Method of obtaining data
net_path <- system.file("extdata","ppi_network.Rdata",package = "ssMutPA")
pathway_path<-system.file("extdata","kegg_323_gmt.Rdata",package = "ssMutPA")
#perform the function 'get_RWR_ES'
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 1.2 GiB
#view the first six lines of pathway enrichment profile
#>                                                      TCGA-32-1979-01A
#> ABC transporters                                              0.00000
#> AGE-RAGE signaling pathway in diabetic complications          0.65015
#> AMPK signaling pathway                                        0.56116
#> Acute myeloid leukemia                                        0.69988
#> Adherens junction                                             0.71432
#> Adipocytokine signaling pathway                               0.69112
#>                                                      TCGA-32-2494-01A
#> ABC transporters                                              0.00000
#> AGE-RAGE signaling pathway in diabetic complications          0.67965
#> AMPK signaling pathway                                        0.59142
#> Acute myeloid leukemia                                        0.72841
#> Adherens junction                                             0.66251
#> Adipocytokine signaling pathway                               0.63442

Identification of cancer subtypes based on unsupervised spectral clustering algorithm.

The function get_samp_class can perform spectral clustering based on the ssMutPESs of characteristic pathways.

#Load sample mutation data
surv_path <- system.file("extdata","sur.Rdata",package = "ssMutPA")
#Perform the function `get_samp_class`
res<-get_samp_class(Path_ES,sur,seed_num=5,cox_pval=0.05, = 2, =5)
#view the label of samples
#> TCGA-32-1979-01A TCGA-4W-AA9S-01A TCGA-06-5858-01A TCGA-19-4068-01A 
#>                1                1                2                1 
#> TCGA-06-0644-01A TCGA-12-3649-01A TCGA-06-0875-01A TCGA-12-0691-01A 
#>                1                1                1                1 
#> TCGA-19-5955-01A TCGA-12-5295-01A 
#>                2                1

Visualization results

(1) The function get_heatmap is used to draw a heatmap based on ssMutPES profiles.The command lines are as follows:

#Load the data
#perform the function `get_heatmap`.

plot of chunk unnamed-chunk-6

(2) The function mountain_plot is used to plot a graph to reflect the distribution of the ssMutPESs in different subtypes.

#Get the data of ROC curve
#perform the function `mountain_plot`
#> Picking joint bandwidth of 0.0119

plot of chunk unnamed-chunk-7

(3) The function dotplot was used to draw a graph to reflect the univariate HRs and P-values of pathways in different cancer types.

#load the data
#perform the function `dotplot`.

plot of chunk unnamed-chunk-8

(4) The function Oncoplot will generate a waterfall plot.

#load the data
mut_path <- system.file("extdata","maffile.txt",package = "ssMutPA")
maf<-maftools::read.maf(mut_path ,isTCGA = FALSE)
pathway_path <- system.file("extdata","kegg_323_gmt.Rdata",package = "ssMutPA")
#draw a waterfall plot
Oncoplot(maf,samp_class_onco,sur_onco,mut_onco,kegg_323_gmt,"IL-17 signaling pathway")

plot of chunk unnamed-chunk-9

#> -Reading
#> -Validating
#> --Removed 1121 duplicated variants
#> -Silent variants: 607 
#> -Summarizing
#> --Possible FLAGS among top ten genes:
#>   MUC16
#>   TTN
#> -Processing clinical data
#> --Missing clinical data
#> -Finished in 0.880s elapsed (0.290s cpu) 
#> --Possible FLAGS among top ten genes:
#>   MUC16
#>   TTN
#> -Processing clinical data