Rentrez Tutorial

David winter

2016-10-26

Introduction: The NCBI, entrez and rentrez.

The NCBI shares a lot of data. At the time this document was compiled, there were 26.6 million papers in PubMed, including 4.2 million full-text records available in PubMed Central. The NCBI Nucleotide Database (which includes GenBank) has data for 219.2 million different sequences, and dbSNP describes 819.3 million different genetic variants. All of these records can be cross-referenced with the 1.2 million species in the NCBI taxonomy or 24.9 thousand disease-associated records in OMIM.

The NCBI makes this data available through a web interface, an FTP server and through a REST API called the Entrez Utilities (Eutils for short). This package provides functions to use that API, allowing users to gather and combine data from multiple NCBI databases in the comfort of an R session or script.

Getting started with the rentrez

To make the most of all the data the NCBI shares you need to know a little about their databases, the records they contain and the ways you can find those records. The NCBI provides extensive documentation for each of their databases and for the EUtils API that rentrez takes advantage of. There are also some helper functions in rentrez that help users learn their way around the NCBI’s databases.

First, you can use entrez_dbs() to find the list of available databases:

entrez_dbs()
##  [1] "pubmed"          "protein"         "nuccore"        
##  [4] "nucleotide"      "nucgss"          "nucest"         
##  [7] "structure"       "sparcle"         "genome"         
## [10] "annotinfo"       "assembly"        "bioproject"     
## [13] "biosample"       "blastdbinfo"     "books"          
## [16] "cdd"             "clinvar"         "clone"          
## [19] "gap"             "gapplus"         "grasp"          
## [22] "dbvar"           "gene"            "gds"            
## [25] "geoprofiles"     "homologene"      "medgen"         
## [28] "mesh"            "ncbisearch"      "nlmcatalog"     
## [31] "omim"            "orgtrack"        "pmc"            
## [34] "popset"          "probe"           "proteinclusters"
## [37] "pcassay"         "biosystems"      "pccompound"     
## [40] "pcsubstance"     "pubmedhealth"    "seqannot"       
## [43] "snp"             "sra"             "taxonomy"       
## [46] "unigene"         "gencoll"         "gtr"

There is a set of functions with names starting entrez_db_ that can be used to gather more information about each of these databases:

Functions that help you learn about NCBI databases

Function name Return
entrez_db_summary() Brief description of what the database is
entrez_db_searchable() Set of search terms that can used with this database
entrez_db_links() Set of databases that might contain linked records

For instance, we can get a description of the somewhat cryptically named database ‘cdd’…

entrez_db_summary("cdd")
##  DbName: cdd
##  MenuName: Conserved Domains
##  Description: Conserved Domain Database
##  DbBuild: Build160706-0602.1
##  Count: 52411
##  LastUpdate: 2016/07/06 12:08

… or find out which search terms can be used with the Sequence Read Archive (SRA) database (which contains raw data from sequencing projects):

entrez_db_searchable("sra")
## Searchable fields for database 'sra'
##   ALL     All terms from all searchable fields 
##   UID     Unique number assigned to publication 
##   FILT    Limits the records 
##   ACCN    Accession number of sequence 
##   TITL    Words in definition line 
##   PROP    Classification by source qualifiers and molecule type 
##   WORD    Free text associated with record 
##   ORGN    Scientific and common names of organism, and all higher levels of taxonomy 
##   AUTH    Author(s) of publication 
##   PDAT    Date sequence added to GenBank 
##   MDAT    Date of last update 
##   GPRJ    BioProject 
##   BSPL    BioSample 
##   PLAT    Platform 
##   STRA    Strategy 
##   SRC     Source 
##   SEL     Selection 
##   LAY     Layout 
##   RLEN    Percent of aligned reads 
##   ACS     Access is public or controlled 
##   ALN     Percent of aligned reads 
##   MBS     Size in megabases

Just how these ‘helper’ functions might be useful will become clearer once you’ve started using rentrez, so let’s get started.

Getting summary data: entrez_summary()

Having found the unique IDs for some records via entrez_search or entrez_link(), you are probably going to want to learn something about them. The Eutils API has two ways to get information about a record. entrez_fetch() returns ‘full’ records in varying formats and entrez_summary() returns less information about each record, but in relatively simple format. Very often the summary records have the information you are after, so rentrez provides functions to parse and summarise summary records.

The summary record

entrez_summary() takes a vector of unique IDs for the samples you want to get summary information from. Let’s start by finding out something about the paper describing Taxize, using its PubMed ID:

taxize_summ <- entrez_summary(db="pubmed", id=24555091)
taxize_summ
## esummary result with 42 items:
##  [1] uid               pubdate           epubdate         
##  [4] source            authors           lastauthor       
##  [7] title             sorttitle         volume           
## [10] issue             pages             lang             
## [13] nlmuniqueid       issn              essn             
## [16] pubtype           recordstatus      pubstatus        
## [19] articleids        history           references       
## [22] attributes        pmcrefcount       fulljournalname  
## [25] elocationid       doctype           srccontriblist   
## [28] booktitle         medium            edition          
## [31] publisherlocation publishername     srcdate          
## [34] reportnumber      availablefromurl  locationlabel    
## [37] doccontriblist    docdate           bookname         
## [40] chapter           sortpubdate       sortfirstauthor

Once again, the object returned by entrez_summary behaves like a list, so you can extract elements using $. For instance, we could convert our PubMed ID to another article identifier…

taxize_summ$articleids
##       idtype idtypen                           value
## 1     pubmed       1                        24555091
## 2        doi       3 10.12688/f1000research.2-191.v2
## 3        pmc       8                      PMC3901538
## 4        rid       8                        24563765
## 5        eid       8                        24555091
## 6    version       8                               2
## 7 version-id       8                               2
## 8      pmcid       5             pmc-id: PMC3901538;

…or see how many times the article has been cited in PubMed Central papers

taxize_summ$pmcrefcount
## [1] 7

Dealing with many records

If you give entrez_summary() a vector with more than one ID you’ll get a list of summary records back. Let’s get those Plasmodium vivax papers we found in the entrez_search() section back, and fetch some summary data on each paper:

vivax_search <- entrez_search(db = "pubmed",
                              term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
multi_summs <- entrez_summary(db="pubmed", id=vivax_search$ids)

rentrez provides a helper function, extract_from_esummary() that takes one or more elements from every summary record in one of these lists. Here it is working with one…

extract_from_esummary(multi_summs, "fulljournalname")
##                                                                                                                 24861816 
## "Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases" 
##                                                                                                                 24145518 
##                                                                                  "Antimicrobial agents and chemotherapy" 
##                                                                                                                 24007534 
##                                                                                                        "Malaria journal" 
##                                                                                                                 23230341 
##                                                                                     "The Korean journal of parasitology" 
##                                                                                                                 23043980 
##                                                                                              "Experimental parasitology" 
##                                                                                                                 20810806 
##                                                                  "The American journal of tropical medicine and hygiene" 
##                                                                                                                 20412783 
##                                                                                                           "Acta tropica" 
##                                                                                                                 19597012 
##                                                                                          "Clinical microbiology reviews" 
##                                                                                                                 17556611 
##                                                                  "The American journal of tropical medicine and hygiene" 
##                                                                                                                 17519409 
##                                                                                                                   "JAMA" 
##                                                                                                                 17368986 
##                                                                                                 "Trends in parasitology" 
##                                                                                                                 12374849 
##                                        "Proceedings of the National Academy of Sciences of the United States of America"

… and several elements:

date_and_cite <- extract_from_esummary(multi_summs, c("pubdate", "pmcrefcount",  "title"))
knitr::kable(head(t(date_and_cite)), row.names=FALSE)
pubdate pmcrefcount title
2014 Aug Prevalence of mutations in the antifolates resistance-associated genes (dhfr and dhps) in Plasmodium vivax parasites from Eastern and Central Sudan.
2014 5 Prevalence of polymorphisms in antifolate drug resistance molecular marker genes pvdhfr and pvdhps in clinical isolates of Plasmodium vivax from Kolkata, India.
2013 Sep 5 2 Prevalence and patterns of antifolate and chloroquine drug resistance markers in Plasmodium vivax across Pakistan.
2012 Dec 11 Prevalence of drug resistance-associated gene mutations in Plasmodium vivax in Central China.
2012 Dec 7 Novel mutations in the antifolate drug resistance marker genes among Plasmodium vivax isolates exhibiting severe manifestations.
2010 Sep 15 Mutations in the antifolate-resistance-associated genes dihydrofolate reductase and dihydropteroate synthase in Plasmodium vivax isolates from malaria-endemic countries.

Fetching full records: entrez_fetch()

As useful as the summary records are, sometimes they just don’t have the information that you need. If you want a complete representation of a record you can use entrez_fetch, using the argument rettype to specify the format you’d like the record in.

Fetch DNA sequences in fasta format

Let’s extend the example given in the entrez_link() section about finding transcript for a given gene. This time we will fetch cDNA sequences of those transcripts.We can start by repeating the steps in the earlier example to get nucleotide IDs for refseq transcripts of two genes:

gene_ids <- c(351, 11647)
linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids, db="nuccore")
linked_transripts <- linked_seq_ids$links$gene_nuccore_refseqrna
head(linked_transripts)
## [1] "1039766414" "1039766413" "1039766411" "1039766410" "1039766409"
## [6] "563317856"

Now we can get our sequences with entrez_fetch, setting rettype to “fasta” (the list of formats available for each database is give in this table):

all_recs <- entrez_fetch(db="nuccore", id=linked_transripts, rettype="fasta")
class(all_recs)
## [1] "character"
nchar(all_recs)
## [1] 55183

Congratulations, now you have a really huge character vector! Rather than printing all those thousands of bases we can take a peak at the top of the file:

cat(strwrap(substr(all_recs, 1, 500)), sep="\n")
## >XM_006538500.2 PREDICTED: Mus musculus alkaline phosphatase,
## liver/bone/kidney (Alpl), transcript variant X5, mRNA
## GCGCCCGTGGCTTGCGCGACTCCCACGCGCGCGCTCCGCCGGTCCCGCAGTGACTGTCCCAGCCACGGTG
## GGGACACGTGGAAGGTCAGGCTCCCTGGGGACCCACGACCTCCCGCTCCGGACTCCGCGCGCATCTCTTG
## TGGCCTGGCAGGATGATGGACGTGGCGCCCGCTGAGCCGCTACCCAGGACCTCACCCTCGTGCTAAGCAC
## CTGCTCCCGGTGCCCACGCGCCTCCGTAGTCCACAGCTGCGCCCTTCGTGGTCCCTTGGCACTCTGTCCC
## GTTGGTGTCTAAAGTAGTTGGGGAGCAGCAGGAAGAAGGCACGTGCTGCGATCTTTGGCGGGAGAGATCG
## GAGACCGCGTGCTAGTGTCTGTCTGAGAG

If we wanted to use these sequences in some other application we could write them to file:

write(all_recs, file="my_transcripts.fasta")

Alternatively, if you want to use them within an R session
we could write them to a temporary file then read that. In this case I’m using read.dna() from the pylogenetics package ape (but not executing the code block in this vignette, so you don’t have to install that package):

temp <- tempfile()
write(all_recs, temp)
parsed_recs <- ape::read.dna(all_recs, temp)

Fetch a parsed XML document

Most of the NCBI’s databases can return records in XML format. In additional to downloading the text-representation of these files, entrez_fetch() can return objects parsed by the XML package. As an example, we can check out the Taxonomy database’s record for (did I mention they are amazing….) Tetrahymena thermophila, specifying we want the result to be parsed by setting parsed=TRUE:

Tt <- entrez_search(db="taxonomy", term="(Tetrahymena thermophila[ORGN]) AND Species[RANK]")
tax_rec <- entrez_fetch(db="taxonomy", id=Tt$ids, rettype="xml", parsed=TRUE)
class(tax_rec)
## [1] "XMLInternalDocument" "XMLAbstractDocument"

The package XML (which you have if you have installed rentrez) provides functions to get information from these files. For relatively simple records like this one you can use XML::xmlToList:

tax_list <- XML::xmlToList(tax_rec)
tax_list$Taxon$GeneticCode
## $GCId
## [1] "6"
## 
## $GCName
## [1] "Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear"

For more complex records, which generate deeply-nested lists, you can use XPath expressions along with the function XML::xpathSApply or the extraction operatord [ and [[ to extract specific parts of the file. For instance, we can get the scientific name of each taxon in T. thermophila’s lineage by specifying a path through the XML

tt_lineage <- tax_rec["//LineageEx/Taxon/ScientificName"]
tt_lineage[1:4]
## [[1]]
## <ScientificName>cellular organisms</ScientificName> 
## 
## [[2]]
## <ScientificName>Eukaryota</ScientificName> 
## 
## [[3]]
## <ScientificName>Alveolata</ScientificName> 
## 
## [[4]]
## <ScientificName>Ciliophora</ScientificName>

As the name suggests, XML::xpathSApply() is a counterpart of base R’s sapply, and can be used to apply a function to nodes in an XML object. A particularly useful function to apply is XML::xmlValue, which returns the content of the node:

XML::xpathSApply(tax_rec, "//LineageEx/Taxon/ScientificName", XML::xmlValue)
##  [1] "cellular organisms" "Eukaryota"          "Alveolata"         
##  [4] "Ciliophora"         "Intramacronucleata" "Oligohymenophorea" 
##  [7] "Hymenostomatida"    "Tetrahymenina"      "Tetrahymenidae"    
## [10] "Tetrahymena"

There are a few more complex examples of using XPath on the rentrez wiki

Using NCBI’s Web History features

When you are dealing with very large queries it can be time consuming to pass long vectors of unique IDs to and from the NCBI. To avoid this problem, the NCBI provides a feature called “web history” which allows users to store IDs on the NCBI servers then refer to them in future calls.

Post a set of IDs to the NCBI for later use: entrez_post()

If you have a list of many NCBI IDs that you want to use later on, you can post them to the NCBI’s severs. In order to provide a brief example, I’m going to post just one ID, the omim identifier for asthma:

upload <- entrez_post(db="omim", id=600807)
upload
## Web history object (QueryKey = 1, WebEnv = NCID_1_54492...)

The NCBI sends you back some information you can use to refer to the posted IDs. In rentrez, that information is represented as a web_history object.

Use a web_history object

Once you have those IDs stored on the NCBI’s servers, you are going to want to do something with them. The functions entrez_fetch() entrez_summary() and entrez_link() can all use web_history objects in exactly the same way they use IDs.

So, we could repeat the last example (finding variants linked to asthma), but this time using the ID we uploaded earlier

asthma_variants <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", web_history=upload)
asthma_variants
## elink object with contents:
##  $web_histories: Objects containing web history information

… if we want to get some genetic information about these variants we need to map our clinvar IDs to SNP IDs:

snp_links <- entrez_link(dbfrom="clinvar", db="snp", 
                         web_history=asthma_variants$web_histories$omim_clinvar,
                         cmd="neighbor_history")
snp_summ <- entrez_summary(db="snp", web_history=snp_links$web_histories$clinvar_snp)
knitr::kable(extract_from_esummary(snp_summ, c("chr", "fxn_class", "global_maf")))
41364547 11558538 2303067 20541
chr 11 2 5 5
fxn_class intron-variant,utr-variant-5-prime missense,reference,utr-variant-5-prime missense,reference missense,reference
global_maf A=0.0036/18 T=0.0595/298 G=0.4331/2169 A=0.2700/1352

If you really wanted to you could also use web_history objects to download all those thousands of COI sequences. When downloading large sets of data, it is a good idea to take advantage of the arguments retmax and restart to split the request up into smaller chunks. For instance, we could get the first 200 sequences in 50-sequence chunks:

(note: this code block is not executed as part of the vignette to save time and bandwidth):

for( seq_start in seq(1,200,50)){
    recs <- entrez_fetch(db="nuccore", web_history=snail_coi$web_history,
                         rettype="fasta", retmax=50, retstart=seq_start)
    cat(recs, file="snail_coi.fasta", append=TRUE)
    cat(seq_start+49, "sequences downloaded\r")
}

What next ?

This tutorial has introduced you to the core functions of rentrez, there are almost limitless ways that you could put them together. Check out the wiki for more specific examples, and be sure to read the inline-documentation for each function. If you run into problem with rentrez, or just need help with the package and Eutils please contact us by opening an issue at the github repository