# pubtatordb

## Overview

PubTator is an NCBI product that contains detailed annotations of abstracts found on PubMed. This makes it a very useful research tool. While PubTator does provide an API, the use of an API is inconvenient for high-throughput analyses and also requires a guaranteed internet connection. Querying a local PubTator database is better suited for high-throughput analyses. The package pubtatordb makes it easy to quickly start using a local copy of PubTator’s data.

## Installation

You can install the released version of pubtatordb from CRAN with:

install.packages("pubtatordb")

The version on GitHub can be downloaded using the devtools package with:

install.packages("devtools")
devtools::install_github("mamc-dci/pubtatordb")

## Example

# Load the package.
library(pubtatordb)

After loading the package, database setup and querying can be accomplished in four steps.

• Database setup
• Connect to the database
• Query the database

After the user manually creates a folder to store the data, the user can define the path to that folder and then download the data to that location:

# Download the data.
# Use the full path. Writing to the temp directory is not recommended.
download_pt(download_dir)

After defining the path to the download directory created above, the database can be created with:

# Define the data directory, a subdirectory of the above directory.

# Create the database.
pt_to_sql(
pubtator_path,
skip_behavior = FALSE,
remove_behavior = TRUE
)

If the .gz files from PubTator have already been extracted, their extraction can be skipped with the skip_behavior argument. After their insertion into the database, both the .gz and uncompressed files can be removed using the remove_behavior argument.

A connection can be created to the database using pt_connector. Note that this is a wrapper for the dbConnect function of the DBI package.

# Create a connection to the database.
db_con <- pt_connector(pubtator_path)

Querying the data is accomplished using the pt_select function. The first five rows of the gene table can be selected with:

# Query the data.
pt_select(
db_con,
"gene",
columns = NULL,
keys = NULL,
keytype = NULL,
limit = 5
)

The first five results for PMIDs in which the genes with ENTREZ IDs 7356 or 4199 were mentioned can be selected with:

# Query the data.
pt_select(
db_con,
"gene",
columns = c("PMID", "ENTREZID"),
keys = c("7356", "4199"),
keytype = "ENTREZID",
limit = 5
)

## Other tables

PubTator has several datasets. The names of tables in the database can be obtained with:

pt_tables(db_con)

The column names for a particular table can be accessed with:

pt_columns(db_con, "species")

## Note

The citation information for PubTator can be found on the PubTator website or with:

pubtator_citations()
#> Please cite PubTator in any publications:
#> 1. Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522. doi: 10.1093/nar/gkt44
#> 2. Wei CH et. al., Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts, Database (Oxford), bas041, 2012
#> 3. Wei CH et. al., PubTator: A PubMed-like interactive curation system for document triage and literature curation, in Proceedings of BioCreative 2012 workshop, Washington DC, USA, 145-150, 2012

## Disclaimer

The views expressed are those of the author(s) and do not reflect the official policy of the Department of the Army, the Department of Defense or the U.S. Government.