biomartr

rpackages.io rank rstudio mirror downloads rstudio mirror downloads

Genomic Data Retrieval with R

Motivation:

This package is born out of my own frustration to automate the genomic data retrieval process to create computationally reproducible scripts for large-scale genomics studies. Since I couldn’t find easy-to-use and fully reproducible software libraries I sat down and tried to implement a framework that would enable anyone to automate and standardize the genomic data retrieval process. I hope that this package is useful to others as well and that it helps to promote reproducible research in genomics studies.

I happily welcome anyone who wishes to contribute to this project :) Just drop me an email.

Short package description:

The vastly growing number of sequenced genomes allows us to perform a new type of biological research. Using a comparative approach these genomes provide us with new insights on how biological information is encoded on the molecular level and how this information changes over evolutionary time.

The first step, however, of any genome based study is to retrieve genomes and their annotation from databases. To automate the retrieval process of this information on a meta-genomic scale, the biomartr package provides interface functions for genomic sequence retrieval and functional annotation retrieval. The major aim of biomartr is to facilitate computational reproducibility and large-scale handling of genomic data for (meta-)genomic analyses. In addition, biomartr aims to address the genome version crisis. With biomartr users can now control and be informed about the genome versions they retrieve automatically. Many large scale genomics studies lack this information and thus, reproducibility and data interpretation become nearly impossible when documentation of genome version information gets neglected.

In detail, biomartr automates genome, proteome, CDS, RNA, Repeats, GFF/GTF (annotation), genome assembly quality, and metagenome project data retrieval from the major biological databases such as

Furthermore, an interface to the Ensembl Biomart database allows users to retrieve functional annotation for genomic loci using a novel and organism centric search strategy. In addition, users can download entire databases such as NCBI RefSeq, NCBI nr, NCBI nt, NCBI Genbank, etc. as well as ENSEMBL and ENSEMBLGENOMES with only one command.

Citation

I would be very greatful if you could cite the following paper in case biomartr was useful for your own research. I plan on vastly extending the biomartr functionality and usability in the next years. Many thanks in advance :)

Drost HG, Paszkowski J. Biomartr: genomic data retrieval with R. Bioinformatics (2017) 33(8): 1216-1217. doi:10.1093/bioinformatics/btw821.

Feedback

I truly value your opinion and improvement suggestions. Hence, I would be extremely grateful if you could take this 1 minute and 3 question survey (https://goo.gl/forms/Qaoxxjb1EnNSLpM02) so that I can learn how to improve biomartr in the best possible way. Many many thanks in advance.

Installation

# install biomartr 0.8.0
source("http://bioconductor.org/biocLite.R")
biocLite('biomartr')

Example

Collection Retrieval

The automated retrieval of collections (= Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats) will make sure that the genome file of an organism will match the CDS, proteome, RNA, GFF, etc file and was generated using the same genome assembly version. One aspect of why genomics studies fail in computational and biological reproducibility is that it is not clear whether CDS, proteome, RNA, GFF, etc files used in a proposed analysis were generated using the same genome assembly file denoting the same genome assembly version. To avoid this seemingly trivial mistake we encourage users to retrieve genome file collections using the biomartr function getCollection() and attach the corresponding output as Supplementary Data to the respective genomics study to ensure computational and biological reproducibility.

# download collection for Saccharomyces cerevisiae
getCollection( db = "refseq", 
               organism = "Saccharomyces cerevisiae", 
               path = file.path("refseq","Collections"))

Internally, the getCollection() function will now generate a folder named refseq/Collection/Saccharomyces_cerevisiae and will store all genome and annotation files for Saccharomyces cerevisiae in the same folder. In addition, the exact genoem and annotation version will be logged in the doc folder.

Internally, a text file named doc_Saccharomyces_cerevisiae_db_refseq.txt is generated. The information stored in this log file is structured as follows:

File Name: Saccharomyces_cerevisiae_assembly_stats_refseq.txt
Organism Name: Saccharomyces_cerevisiae
Database: NCBI refseq
URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_assembly_stats.txt
Download_Date: Wed Jun 27 15:21:51 2018
refseq_category: reference genome
assembly_accession: GCF_000146045.2
bioproject: PRJNA128
biosample: NA
taxid: 559292
infraspecific_name: strain=S288C
version_status: latest
release_type: Major
genome_rep: Full
seq_rel_date: 2014-12-17
submitter: Saccharomyces Genome Database

In an ideal world this reference file could then be included as supplementary information in any life science publication that relies on genomic information so that reproducibility of experiments and analyses becomes achievable.

Genome retrieval of hundreds of genomes using only one command

Download all mammalian vertebrate genomes from NCBI RefSeq via:

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome")

All geneomes are stored in the folder named according to the kingdom. In this case vertebrate_mammalian. Alternatively, users can specify the out.folder argument to define a custom output folder path.

Platforms

Find biomartr also at OmicTools.

Frequently Asked Questions (FAQs)

Please find all FAQs here.

Discussions and Bug Reports

I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.

Furthermore, in case you find some bugs or need additional (more flexible) functionality of parts of this package, please let me know:

twitter: HajkDrost or email

For Bug Reports: Please send me an issue.

Tutorials

Getting Started with biomartr:

Users can also read the tutorials within (RStudio) :

# source the biomartr package
library(biomartr)

# look for all tutorials (vignettes) available in the biomartr package
# this will open your web browser
browseVignettes("biomartr")

NEWS

The current status of the package as well as a detailed history of the functionality of each version of biomartr can be found in the NEWS section.

Install Developer Version

Some bug fixes or new functionality will not be available on CRAN yet, but in the developer version here on GitHub. To download and install the most recent version of biomartr run:

# install the current version of biomartr on your system
source("http://bioconductor.org/biocLite.R")
biocLite("ropensci/biomartr")

Genomic Data Retrieval

Meta-Genome Retrieval

Genome Retrieval

Import Downloaded Files

Database Retrieval

BioMart Queries

Performing Gene Ontology queries

Gene Ontology

Download Developer Version On Windows Systems

# On Windows, this won't work - see ?build_github_devtools
install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)

# When working with Windows, first you need to install the
# R package: rtools -> install.packages("rtools")

# Afterwards you can install devtools -> install.packages("devtools")
# and then you can run:

devtools::install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)

# and then call it from the library
library("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library")

Troubleshooting on Windows Machines

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.