Introduction to the msigdbr package

Overview

Performing pathway analysis is a common task in genomics and there are many available software tools, many of which are R-based. Depending on the tool, it may be necessary to import the pathways into R, translate genes to the appropriate species, convert between symbols and IDs, and format the object in the required way.

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

Usage

Load package.

library(msigdbr)

Check the available species.

msigdbr_show_species()
#> [1] "Bos taurus" "Caenorhabditis elegans" "Canis lupus familiaris"
#> [4] "Danio rerio" "Drosophila melanogaster" "Gallus gallus"
#> [7] "Homo sapiens" "Mus musculus" "Rattus norvegicus"
#> [10] "Saccharomyces cerevisiae" "Sus scrofa"

Retrieve all human gene sets.

m_df = msigdbr(species = "Homo sapiens")
head(m_df)
#> # A tibble: 6 x 9
#> gs_name gs_id gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 AAACCAC_MIR140 M12609 C3 MIR ABCC4 Homo sap… 10257 ABCC4 <NA>
#> 2 AAACCAC_MIR140 M12609 C3 MIR ACTN4 Homo sap… 81 ACTN4 <NA>
#> 3 AAACCAC_MIR140 M12609 C3 MIR ACVR1 Homo sap… 90 ACVR1 <NA>
#> 4 AAACCAC_MIR140 M12609 C3 MIR ADAM9 Homo sap… 8754 ADAM9 <NA>
#> 5 AAACCAC_MIR140 M12609 C3 MIR ADAMTS5 Homo sap… 11096 ADAMTS5 <NA>
#> 6 AAACCAC_MIR140 M12609 C3 MIR AGER Homo sap… 177 AGER <NA>

Retrieve mouse hallmark collection gene sets.

m_df = msigdbr(species = "Mus musculus", category = "H")
head(m_df)
#> # A tibble: 6 x 9
#> gs_name gs_id gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 HALLMARK… M5905 H "" ABCA1 Mus musc… 11303 Abca1 Inparanoid,H…
#> 2 HALLMARK… M5905 H "" ABCB8 Mus musc… 74610 Abcb8 Inparanoid,P…
#> 3 HALLMARK… M5905 H "" ACAA2 Mus musc… 52538 Acaa2 Inparanoid,O…
#> 4 HALLMARK… M5905 H "" ACADL Mus musc… 11363 Acadl Inparanoid,O…
#> 5 HALLMARK… M5905 H "" ACADM Mus musc… 11364 Acadm Inparanoid,P…
#> 6 HALLMARK… M5905 H "" ACADS Mus musc… 11409 Acads Inparanoid,O…

Retrieve mouse C2 (curated) CGP (chemical and genetic perturbations) gene sets.

m_df = msigdbr(species = "Mus musculus", category = "C2", subcategory = "CGP")
head(m_df)
#> # A tibble: 6 x 9
#> gs_name gs_id gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 ABBUD_LI… M1423 C2 CGP AHNAK Mus musc… 66395 Ahnak Inparanoid,H…
#> 2 ABBUD_LI… M1423 C2 CGP ALCAM Mus musc… 11658 Alcam Inparanoid,P…
#> 3 ABBUD_LI… M1423 C2 CGP ANKRD40 Mus musc… 71452 Ankrd40 Inparanoid,P…
#> 4 ABBUD_LI… M1423 C2 CGP BCKDHB Mus musc… 12040 Bckdhb Inparanoid,P…
#> 5 ABBUD_LI… M1423 C2 CGP C16orf89 Mus musc… 239691 AU021092 Inparanoid,P…
#> 6 ABBUD_LI… M1423 C2 CGP CAPN9 Mus musc… 73647 Capn9 Inparanoid,O…

The msigdbr() function output can also be manipulated as a standard data frame.

m_df = msigdbr(species = "Mus musculus") %>% dplyr::filter(gs_cat == "H")
head(m_df)
#> # A tibble: 6 x 9
#> gs_name gs_id gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 HALLMARK… M5905 H "" ABCA1 Mus musc… 11303 Abca1 Inparanoid,H…
#> 2 HALLMARK… M5905 H "" ABCB8 Mus musc… 74610 Abcb8 Inparanoid,P…
#> 3 HALLMARK… M5905 H "" ACAA2 Mus musc… 52538 Acaa2 Inparanoid,O…
#> 4 HALLMARK… M5905 H "" ACADL Mus musc… 11363 Acadl Inparanoid,O…
#> 5 HALLMARK… M5905 H "" ACADM Mus musc… 11364 Acadm Inparanoid,P…
#> 6 HALLMARK… M5905 H "" ACADS Mus musc… 11409 Acads Inparanoid,O…

Integrating with Pathway Analysis Packages

Use the gene sets data frame for clusterProfiler (for genes as Entrez Gene IDs).

m_t2g = m_df %>% dplyr::select(gs_name, entrez_gene) %>% as.data.frame()
enricher(gene = genes_entrez, TERM2GENE = m_t2g, ...)

Use the gene sets data frame for clusterProfiler (for genes as gene symbols).

m_t2g = m_df %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = genes_symbols, TERM2GENE = m_t2g, ...)

Use the gene sets data frame for fgsea.

m_list = m_df %>% split(x = .$gene_symbol, f = .$gs_name)
fgsea(pathways = m_list, ...)

Questions and Concerns

Which version of MSigDB was used?

This package was generated with MSigDB v6.2 (released July 2018). The MSigDB version is used as the base of the package version. You can check the installed version with packageVersion("msigdbr").

Can’t I just download the gene sets from MSigDB?

Yes. You can then import the GMT files with getGmt() from the GSEABase package. The GMTs only include the human genes, even for gene sets generated from mouse data. If you are not working with human data, you then have to convert the MSigDB genes to your organism or your genes to human.

Can’t I just convert between human and mouse genes by adjusting gene capitalization?

That will work for most genes, but not all.

Can’t I just convert human genes to any organism myself?

Yes. A popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.

Aren’t there already other similar tools?

There are a few other resources that and provide some of the functionality and served as an inspiration for this package. Ge Lab Gene Set Files has GMT files for many species. WEHI provides MSigDB gene sets in R format for human and mouse, but the genes are provided only as Entrez IDs and each collection is a separate file. MSigDF is based on the WEHI resource, so it provides the same data, but converted to a more tidyverse-friendly data frame. When msigdbr was initially released, all of them were multiple releases behind the latest version of MSigDB, so they are possibly no longer maintained.

Details

The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software.

Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.