netgsa: Network-based Gene Set Analysis

Jing Ma, Michael Hellstern



In this vignette, we demonstrate the NetGSA workflow using a subset of the breast cancer data example downloaded from the Cancer Genome Atlas (TCGA 2012). In particular, we illustrate how to incorporate existing network information (e.g. curated edges from KEGG) to improve the power of pathway enrichment analysis with NetGSA. Details of the method are avaialble in Ma, Shojaie, and Michailidis (2016).

netgsa provides simple functions that automatically search known gene databases using graphite for network information and integrate seamlessly with NetGSA pathway enrichment andigraph and Cytoscape plotting. In this vignette, we demonstrate the NetGSA workflow explaining proper data types and usage of the netgsa package.

Data: Breast cancer study (2012)

Our example data set comes from a breast cancer study (TCGA 2012), which consists of gene expression data from 520 subjects including 117 estrogen-receptor-negative (ER-) and 403 estrogen-receptor-positive (ER+). The goal is to generate diagnostic pathway signatures that could distinguish patients with different ER statuses by comparing gene expression data from the two classes. These signatures could provide additional clinical benefit in diagnosing breast cancer.

Step 1: Data set-up

NetGSA works directly with the expression data matrix. When loading the data, it is important to check the distribution of raw sequencing reads and perform log transformation if necessary. Data in this example were already log transformed. Rows of the data matrix correspond to genes and columns to subjects. Genes in this data matrix were labeled with Entrez IDs, same as those used in KEGG pathways.

## [1] "edgelist"     "group"        "nonedgelist"  "pathways"     "pathways_mat"
## [6] "x"

The objects loaded which are necessary for this vignette are:

Data matrix, x

The data matrix, x must have rownames that are named as "GENE_ID:GENE_VALUE". Since netgsa allows the user to search for edges in multiple databases in graphite, each of which may use different identifiers (e.g. KEGG, Reactome, Biocarta, etc.), netgsa needs to know the identifier for a given gene so we can convert it properly. For example, if you have two genes “ENTREZID: 7186” and “ENTREZID: 329” and want to search for potential edges between these two within Reactome, netgsa must first convert these genes to their “UNIPROT” IDs and use those to search for edges.

netgsa uses the AnnotationDbi package to convert genes. The valid list of "GENE_ID"’s is:

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
## [26] "UNIPROT"

An example of what the rownames for the data matrix, x, should look like is:

## [1] "ENTREZID:127550" "ENTREZID:53947"  "ENTREZID:65985"  "ENTREZID:51166" 
## [5] "ENTREZID:15"     "ENTREZID:60496"

The columns of the data matrix do not need to be named, but it is useful for keeping track of groups.

Group vector

The group object is a vector (numeric or character) and must be the same length as ncol(x). Each element of the group tells us which group each column of x corresponds to. For example, if group[1] = "Positive" this says that the first column of x corresponds to the "Positive" condition.

We can find out the ER status by looking at the group labels.

## group
##   1   2 
## 403 117

Edgelist / Non-edgelist

The edgelist and non-edgelist are strings representing file locations and are read in using data.table’s fread() command. These are where users can specify known edges/non-edges of their own. Each observation is assumed to be a directed edge (for edgelist) or a directed non-edge (for non-edgelist). They both must have 4 columns in the following order. The columns do not necessarily need to be named properly, they simply must be in this specific order:

  • 1st column - Source gene (base_gene_src), e.g. “7534”"
  • 2nd column - Gene identifier of the source gene (base_id_src), e.g. “ENTREZID”
  • 3rd column - Destination gene (base_gene_dest), e.g. “8607”
  • 4th column - Gene identifier of the destination gene (base_id_dest) e.g. “UNIPROT”

Note it is assumed that each edge/non-edge is directed so if you want an undirected edge/non-edge you should put in two observations as in:

##    base_gene_src base_id_src base_gene_dest base_id_dest
## 1:          7534    ENTREZID           8607     ENTREZID
## 2:          8607    ENTREZID           7534     ENTREZID

Pathway matrix

Lastly, as documented in ?NetGSA we need an indicator matrix of pathways. The rows of this matrix should correspond to pathways and the columns to the genes included in the analysis. The dimension is then npath x p. Values in this matrix should be 1 for genes that are in a given pathway and 0 otherwise.

We consider pathways from KEGG (Kanehisa and Goto 2000). KEGG pathways can be accessed in R using the graphite package.

paths <- graphite::pathways('hsapiens','kegg')
## "Glycolysis / Gluconeogenesis" pathway
## Native ID       = hsa:00010
## Database        = KEGG
## Species         = hsapiens
## Number of nodes = 94
## Number of edges = 1191
## Retrieved on    = 22-04-2020
## URL             =
## [1] "ENTREZID:10327" "ENTREZID:124"   "ENTREZID:125"   "ENTREZID:126"  
## [5] "ENTREZID:127"   "ENTREZID:128"

For this vignette, we’ll use pathways_mat:

pathways_mat[1:5,7, drop = FALSE]
##                                                      ENTREZID:18
## Adipocytokine signaling pathway                                0
## Adrenergic signaling in cardiomyocytes                         0
## AGE-RAGE signaling pathway in diabetic complications           0
## Alanine, aspartate and glutamate metabolism                    1
## AMPK signaling pathway                                         0

The 1 shown indicates that "ENTREZID: 18" belongs to the Alanine, aspartate and glutamate metabolism pathway.

In summary, the data that we’ll need to run our pathway enrichment analysis is:

  • the data matrix (x), with rows for genes and columns for samples,
  • the group labels (group) for ER status,
  • the path to the file consisting of known edges (edgelist) to be read in with fread(),
  • the path to the file consisting of (a subset of) the non-edges (nonedgelist) to be read in with fread(),
  • pathway identifiers in matrix form (pathways_mat).

Step 2: Pathway enrichment analysis

Now that we have our data set-up properly we can begin with the pathway enrichment analysis. netgsa has three main functions. The first is prepareAdjMat which searches for edges in public databases and uses this information to estimate the adjacency matrices needed for NetGSA. The second is NetGSA which performs network-based gene set analysis. We will go into further details about the usage of these functions below.


After having the data set-up, the first step in pathway enrichment analysis with netgsa is to estimate the adjacency matrices. Remember, the rownames of the data matrix X must be named as "GENE_ID:GENE_VALUE" as in "ENTREZID:7534". The group vector is defined as above.

The databases argument can be either (1) the result of obtainEdgeList or (2) a character vector defining the databases to search. These are essentially the same thing, the only difference is that for the character vector method, obtainEdgeList is called inside prepareAdjMat and cannot be saved. Since prepareAdjMat queries the graphite databases when it is called and graphite databases can change overtime, this may not be desirable for reproducibility. Using the obtainEdgeList method, one can save the edgelist to ensure the same network information is used across iterations or in the future. In both methods, one must specify the databases to search. The options are the databases for homo spaiens available in graphite or NDEx (only for development version on Github). The graphite databases are:

## [1] "biocarta" "kegg"     "nci"      "panther"  "pathbank" "pharmgkb" "reactome"
## [8] "smpdb"

Note the databases argument is case sensitive so make sure to pass "reactome" and not "Reactome".

The cluster argument controls whether or not clustering is used when estimating the adjacency matrix. The default behavior is to use clustering if p > 2,500. However, the user can override this behavior by setting the cluster argument. prepareAdjMat chooses the best clustering method from 6 possible methods in the igraph package. More details are provided in ?prepareAdjMat.

User specified edge and non-edge files are specified with the file_e and file_ne arguments respectively. For more information on the assumptions and how network information is incorporated from the database edgelists, the user edges, and the user non-edges see the Details section and file_e and file_ne parameters in ?prepareAdjMat.

It is also important to note that prepareAdjMat will automatically choose the correct network estimation technique based on whether or not the graph is directed so no additional work is needed to determine undirected vs directed graphs. This is documented in ?prepareAdjMat.

Now let’s get to some example code. Suppose we wanted to estimate the network for our example data using our known edges/non-edges and searching for edges in Reactome, KEGG, and BioCarta. Let’s also use clustering to speed up computation. We could do this simply by:

database_search <- obtainEdgeList(rownames(x), c("kegg", "reactome"))
network_info <- prepareAdjMat(x, group, database_search,
                                         cluster = TRUE, file_e = "edgelist.txt", 
                                         file_ne = "nonedgelist.txt")

The main value of interest is network_info[["Adj"]], a list of lists. The top level list indicates the condition while the next level is a list of adjacency matrices (one for each cluster). If cluster = FALSE there is an element for each connected component in the network. Also note that in this example, the object database_search was created. If desired, this can be saved and used later to ensure the same network information was used. However, the results may not be exactly the same. For example, some of the clustering algorithms are not deterministic so while the network information may be the same the clusters may be different resulting in different adjacency matrices. The adjacency matrix for the first condition and first cluster can be accessed with:

##             ENTREZID:18 ENTREZID:27 ENTREZID:28
## ENTREZID:18           0           0           0
## ENTREZID:27           0           0           0
## ENTREZID:28           0           0           0

The total number of clusters can be identified with:

## [1] 16

Note we cannot control cluster size which is why we try several methods and implement rules on which to choose. For more information see ?prepareAdjMat.


Now that the adjacency matrices are assembled we can perform pathway inference using NetGSA. The returned value from prepareAdjMat will always be in the correct format regardless of whether or not clustering is used, so one should always be able to pass network_info[["Adj"]] directly to NetGSA. The default is to use restricted Haseman-Elston regression (REHE) with sampling to estimate the variance parameters. This allows for significant computational speed up and little to no loss in power. One can also use REML for variance parameter estimation but this can be quite slow for problems with moderate or large dimension. For explanatory purposes, we explicitly write out the function:

pathway_tests_rehe <- NetGSA(network_info[["Adj"]], x, group, pathways_mat, 
                             lklMethod = "REHE", sampling = TRUE, 
                             sample_n = 0.25, sample_p = 0.25)

The sample_n and sample_p parameters control the ratio for subsampling observations (columns of data matrix x) and variables (rows of data matrix x) respectively. More information on REHE and sampling can be found in the ?NetGSA Details section.

Note that inference using REML can take several hours depending on the complexity of the problem, while inference using REHE can take minutes. Roughly, REML becomes quite slow for p > 2,000. In this example, with sample_n = 0.25, sample_p = 0.25 only took about 2 minutes on a 2 CPU personal computer with 16 GB of RAM.

The main inference results are stored in pathway_tests[["results"]] and contain the pathway names as well as their significance (p-values and q-values are reported) and test statistics.

Step 3: Visualization

netgsa provides several options for visualizing the results from NetGSA. Cytoscape or igraph is used to display the main results depending on whether the user has the Cytoscape app open.


If the user has Cytoscape open on their computer, calling plot.NetGSA will create several plots:

  1. Cytoscape plots - the first place plot.NetGSA generates plots is within Cytoscape. A nested network is created where the main network (Pathway Network) displays pathways as nodes. Edges in this network represent edges between genes contained within those pathways. In this network, the results object from NetGSA() is loaded as the node data and the number of edges between genes of separate pathways are loaded as edge data in a variable called “weight”. Node color is mapped to the test statistic. Negative values are orange, values around 0 are white, and positive values are blue. Node border color is mapped to q-values going from red (0) to white (1).

    In addition to the main network, a network is generated for each pathway. These are the gene networks underlying each pathway and are referred to as the within pathway networks. The within pathway networks are linked to the nodes in the Pathway Network using Cytoscape’s nested network format. An example will make this more clear. Calling:


    The Pathway Network might look something like:

    The most interesting part of the Cytoscape nested network framework is the ability to view the within pathway networks. Suppose you were looking at the Pathway Network and noticed the ErbB signaling network was highly significant and you wanted to investigate it by looking at the underlying gene network. To do this, one could simply right click on the ErbB signaling pathway node in the Pathway Network, and go to Nested Networks -> Go To Nested Network. Alternatively, the nested networks have the same name as the pathway they represent so one could simply navigate to the ErbB signaling pathway using the Network tab in the Control Panel on the left side of the Cytoscape GUI. To save time, the nested networks are not formatted. However, we can use the formatPathways function to format one or multiple using the default style.

    formatPathways(pathway_tests_rehe, "ErbB signaling pathway")

    The formatPathways function colors gene nodes based on FDR adjusted q-value. For more information see ?formatPathways.

    To export images from Cytoscape one can either use the GUI or the RCy3::exportImage() function.

  2. Cytoscape legend generated in R - Legends are difficult to incorporate on the plot within Cytoscape so we generate a legend in R. Alternatively it is also possible to generate an image directly in Cytoscape. The Cytoscape manual page has more information on that:

  3. Igraph plot - Lastly, for users that prefer R, plot.NetGSA will also generate a plot using igraph with the same layout and node color/node border color mapping as in Cytoscape.