An overview of executing searches and generating citations using occCite

Hannah L. Owens

Cory Merow

Brian Maitner

Jamie M. Kass

Vijay Barve

Robert Guralnick



We have entered the age of data-intensive scientific discovery. As data sets increase in complexity and heterogeneity, we must preserve the cycle of data citation from primary data sources to aggregating databases to research products and back to primary data sources. The citation cycle keeps science transparent, but it is also key to supporting primary providers by documenting the use of their data. The Global Biodiversity Information Facility (GBIF), Botanical Information and Ecology Network (BIEN), and other data aggregators have made great strides in harvesting citation data from research products and linking them back to primary data providers. However, this only works if those that publish research products cite primary data sources in the first place. We developed occCite, a set of R-based tools for downloading, managing, and citing biodiversity data, to advance toward the goal of closing the data provenance cycle. These tools preserve links between occurrence data and primary providers once researchers download aggregated data, and facilitate the citation of primary data providers in research papers.

The occCite workflow follows a three-step process. First, the user inputs one or more taxonomic names (or a phylogeny). occCite then rectifies these names by checking them against one or more taxonomic databases, which can be specified by the user (see the Global Names List). The results of the taxonomic rectification are then kept in an occCiteData object in local memory. Next, occCite takes the occCiteData object and user-defined search parameters to query BIEN (through rbien) and/or GBIF(through rGBIF) for records. The results are appended to the occCiteData object, along with metadata on the search. Finally, the user can pass the occCiteData object to occCitation, which compiles citations for the primary providers, database aggregators, and R packages used to build the dataset.

Future iterations of occCite will track citation data through the data cleaning process and provide a series of visualizations on raw query results and final data sets. It will also provide data citations in a format congruent with best-practice recommendations for large biodiversity data sets. Based on these data citation tools, we will also propose a new set of standards for citing primary biodiversity data in published research articles that provides due credit to contributors and allows them to track the use of their work. Keep checking back!


If you plan to query GBIF, you will need to provide them with your user login information. We have provided a dummy login below to show you the format. You will need to provide actual account information. This is because you will actually be downloading all of the records available for the species using occ_download(), instead of getting results from occ_search(), which has a hard limit of 200,000 occurrences.

#Creating a GBIF login
GBIFLogin <- GBIFLoginManager(user = "occCiteTester",
                              email = "****",
                              pwd = "12345")

Advanced features

Loading data from previous GBIF searches

Querying GBIF can take quite a bit of time, especially for multiple species and/or well-known species. In this case, you may wish to access previously-downloaded data sets from your computer by specifying the general location of your downloaded .zip files. occQuery will crawl through your specified GBIFDownloadDirectory to collect all the .zip files contained in that folder and its subfolders. It will then import the most recent downloads that match your taxon list. These GBIF data will be appended to a BIEN search the same as if you do the simple real-time search (if you chose BIEN as well as GBIF), as was shown above. checkPreviousGBIFDownload is TRUE by default, but if loadLocalGBIFDownload is TRUE, occQuery will ignore checkPreviousDownload. It is also worth noting that occCite does not currently support mixed data download sources. That is, you cannot do GBIF queries for some taxa, download previously-prepared data sets for others, and load the rest from local data sets on your computer.

# Simple load
myOldOccCiteObject <- occQuery(x = "Protea cynaroides", 
                                  datasources = c("gbif", "bien"), 
                                  GBIFLogin = NULL, 
                                  GBIFDownloadDirectory = system.file('extdata/', package='occCite'),
                                  loadLocalGBIFDownload = T,
                                  checkPreviousGBIFDownload = F)
## Error in is.nan(x): default method not implemented for type 'list'

Here is the result. Look familiar?

#GBIF search results
head(myOldOccCiteObject@occResults$`Protea cynaroides`$GBIF$OccurrenceTable);
## Error in head(myOldOccCiteObject@occResults$`Protea cynaroides`$GBIF$OccurrenceTable): object 'myOldOccCiteObject' not found
#The full summary
## Error in summary(myOldOccCiteObject): object 'myOldOccCiteObject' not found

Getting citation data works the exact same way with previously-downloaded data as it does from a fresh data set.

#Get citations
myOldOccCitations <- occCitation(myOldOccCiteObject)
## Error in occCitation(myOldOccCiteObject): object 'myOldOccCiteObject' not found
## Error in print(myOldOccCitations): object 'myOldOccCitations' not found

Note that you can also load multiple species using either a vector of species names or a phylogeny (provided you have previously downloaded data for all of the species of interest), and you can load occurrences from non-GBIF data sources (e.g. BIEN) in the same query.

occCite with a Phylogeny

Here is an example of how such a search is structured, using an unpublished phylogeny of billfishes.

#Get tree
treeFile <- system.file("extdata/Fish_12Tax_time_calibrated.tre", package='occCite')
phylogeny <-
tree <- ape::extract.clade(phylogeny, 18)
#Query databases for names
myPhyOccCiteObject <- studyTaxonList(x = tree, datasources = "NCBI")
#Query GBIF for occurrence data
myPhyOccCiteObject <- occQuery(x = myPhyOccCiteObject, 
                            datasources = "gbif",
                            GBIFDownloadDirectory = system.file('extdata/', package='occCite'),
                            loadLocalGBIFDownload = T,
                            checkPreviousGBIFDownload = F)
## Error in is.nan(x): default method not implemented for type 'list'
# What does a multispecies query look like?
##  User query type: User-supplied phylogeny.
##  Sources for taxonomic rectification: NCBI
##  Taxonomic cleaning results:     
##                   Input Name                 Best Match
## 1           Istiompax_indica           Istiompax indica
## 2             Kajikia_albida             Kajikia albida
## 3              Kajikia_audax              Kajikia audax
## 4 Tetrapturus_angustirostris Tetrapturus angustirostris
## 5         Tetrapturus_belone         Tetrapturus belone
## 6        Tetrapturus_georgii        Tetrapturus georgii
## 7      Tetrapturus_pfluegeri      Tetrapturus pfluegeri
##   Taxonomic Databases w/ Matches
## 1                           NCBI
## 2                           NCBI
## 3                           NCBI
## 4                           NCBI
## 5                           NCBI
## 6                           NCBI
## 7                           NCBI

When you have results for multiple species, as in this case, you can also plot the summary figures either for the whole search…

## Error in d.res[[x]]: subscript out of bounds

or you can plot the results by species!

plot(myPhyOccCiteObject, bySpecies = T)
## Error in d.res[[x]]: subscript out of bounds

And then you can print out the citations, separated by species (or not, but in this example, they’re separate).

#Get citations
myPhyOccCitations <- occCitation(myPhyOccCiteObject)

#Print citations as text with accession dates.
print(myPhyOccCitations, bySpecies = T)
## Error in x$occCitationResults[[i]]: subscript out of bounds