BeeBDC BeeBDC logo of a cuckoo bee sweeping up occurrence records in South America

BeeBDC: an occurrence data cleaning package

CRANstatus downloads R-CMD-check License

Overview

The consistent implementation of biodiversity data continues to be a challenge for researchers. We present the BeeBDC package which provides novel and updated functions for flagging, cleaning, and visualising occurrence datasets. Our functions are mostly general in regards to taxon; however, we also provide some functions and data that are specific for use with bee occurrence data. We build upon functions and conventions in other fantastic R packages, especially bdc and CoordinateCleaner, while also removing many dependencies on sp-related packages. Hence, our package name is Bee Biodiversity Data Cleaning (BeeBDC).

We provide a full workflow that uses BeeBDC, bdc, and CoordinateCleaner to clean occurrence data in our Articles page and encourage users to read and also cite our primary publication.

Structure of BeeBDC

The BeeBDC toolkit is organized using the conventions similar to bdc and CoordinateCleaner.

Like in the bdc package, we provide a suggested workflow here. While our functions can mostly be run out of order, there are a few exceptions mentioned throughout the documentation. Additionally, many functions require the database_id column that is generated early on in the BeeBDC or bdc workflows. When running very large datasets (e.g., the global bee occurrence dataset) you may require a machine that has a minimum amount of RAM (~32 GB). However, we do try to provide work-arounds, especially by alowing some functions to be broken into consumable chunks. Paper DOI - https://doi.org/10.1101/2023.06.30.547152; Package GitHub - https://github.com/jbdorey/BeeBDC/

Workflow figure from Dorey et al. 2023

Installation

You can install BeeBDC from CRAN or GitHub.

  # Install BeeBDC from CRAN
install.packages("BeeBDC")

  # Or using the development version from GitHub (keeping in mind this may not be as stable)
remotes::install_github("https://github.com/jbdorey/BeeBDC.git", 
                          # To use the development version use "devel"; otherwise choose "main"
                        ref = "devel", force = TRUE)

Two optional packages can also be downloaded prior to starting your workflow, if desired. But, these are not essential. The packages BiocManager and devtools may also be required to download the two extra packages.

  1. The first package, rnaturalearthhires, is a data package that allows the usage of higher-resolution country maps and is very useful for multiple BeeBDC functions.
  2. The second package, ComplexHeatmap, is only used for one BeeBDC function (chordDiagramR()) and is less critical.

When either of these packages are called, the user will be prompted to install them. However, the latter may try to restart your R session.

  # These two packages may need to be installed in order to install the actual required packages 
    # below.
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
if (!require("devtools", quietly = TRUE))
    install.packages("devtools")

  # Install ComplexHeatmap and rnaturalearthhires
devtools::install_github("ropensci/rnaturalearthhires")
BiocManager::install("ComplexHeatmap")

Load the package with:

library(BeeBDC)

1. Data merge

Integrate and merge different datasets from major the data repositories — GBIF, SCAN, iDigBio, the USGS, and ALA.

2. Data preperation

The reading in and formatting of the major and minor [bee] occurrence repositories as well as some data modifications. This section is mostly, but not entirely, related to bee occurrence data.

3. Initial flags

Flagging and carpentry of several, mostly general, data issues. See bdc’s pre-filter for more related functions.

4. Taxonomy

Harmonisation of scientific names against a custom taxonomy or the provided Discover Life website’s taxonomic reference.

5. Space

Flagging of erroneous, suspicious, and low-precision geographic coordinates.

6. Time

Flagging and, whenever possible, correction of inconsistent collection date.

7. De-duplication

8. Filtering

9. Figures and tables

10. Datasets

We provide two full datasets that are downloadable using the below two functions

We further provide five test datasets that are available with BeeBDC

  # Access the test taxonomy file
system.file("extdata", "testTaxonomy.rda", package="BeeBDC") |> load()
  # View the file
View(testTaxonomy)
  # Access the test checklist file
system.file("extdata", "testChecklist.rda", package="BeeBDC") |> load()
  # View the file
View(testChecklist)

Package website

See BeeBDC package website (https://jbdorey.github.io/BeeBDC/reference/index.html) for detailed explanation on each module.

Getting help

If you encounter a clear bug, please file an issue here. For questions or suggestion, flick us an email (james.dorey@flinders.edu.au).

Citation

Paper, dataset, and package citation: Dorey, J. B., Chesshire, P. R., Bolaños, A. N., O’reilly, R. L., Bossert, S., Collins, S. M., Lichtenberg, E. M., Tucker, E., Smith-Pardo, A., Falcon-Brindis, A., Guevara, D. A., Ribeiro, B. R., De Pedro, D., Fischer, E., Hung, J. K.-L., Parys, K. A., Rogan, M. S., Minckley, R. L., Velzco, S. J. E., Griswold, T., Zarrillo, T. A., Sica, Y., Orr, M. C., Guzman, L. M., Ascher, J., Hughes, A. C. & Cobb, N. S. Accepted. A globally synthesised and flagged bee occurrence dataset and cleaning workflow. Scientific Data.

Package citation: Dorey, J. B., O’Reilly, R. L., Bossert, S., Fischer, E. (2023). BeeBDC: an occurrence data cleaning package. R package version 1.0.2. url: https://github.com/jbdorey/BeeBDC

This package and its data sets were created with the support, and as a part, of the iDigBees project

The iDigBees logo with a colourful bee and the iDigBees text on the right