MarineSPEED quickstart guide

Samuel Bosch

2017-02-17

The goal of MarineSPEED is to provide a benchmark data set for presence-only species distribution modeling (SDM) in order to facilitate reproducible and comparable SDM research. It contains species occurrences (coordinates) from a wide diversity of marine species and associated environmental data from Bio-ORACLE and MARSPEC. Some additional information about MarineSPEED can be found in the R Shiny viewer at http://marinespeed.org.

Èxploring the data

Three functions help with exploring

library(marinespeed)

# set a data directory, preferably something different from tempdir to avoid 
# unnecessary downloads for every R session
options(marinespeed_datadir = tempdir())

# list all species
species <- list_species()

The first 5 species and there aphia_id (WoRMS species id) are:

species aphia_id
Laternula elliptica 197217
Pseudosagitta gazellae 266258
Parasagitta elegans 105440
Parasagitta setosa 105443
Branchiostoma lanceolatum 104906

The species information consists of species identifiers, taxonomic information from the World Register of Marine Species (WoRMS), a visual assessment score for the amount of sampling bias and the covered latitudinal zones.

# all species information
info <- species_info()
colnames(info)
##  [1] "species"        "aphia_id"       "kingdom"        "phylum"        
##  [5] "class"          "order"          "family"         "genus"         
##  [9] "sampling_bias"  "eco_polar"      "eco_temperate"  "eco_tropical"  
## [13] "eco_open_ocean"

Looping over all species data

To loop over the occurrence data of all species you have to call the lapply_species function. For instance if you wanted to count the total number of records in MarineSPEED you’d need the following code. As you can see the function passed to lapply_species expects to parameters, one for the species name and one for the actual occurrences.

get_occ_count <- function(speciesname, occ) {
  nrow(occ)
}
record_counts <- lapply_species(get_occ_count)
sum(unlist(record_counts))
## [1] 868151

Cross-validation

To enable the usage of the same cross-validation k-fold datasets I splitted species occurrence data upfront in 5 folds (or 4 and 9 for grid) in 3 different ways:

Below code plots the training (blue) and test (red) occurrences for the first two disc folds of the first two species.

## plot first 2 disc folds for the first 2 species (blue=trainig, red=test)
plot_occurrences <- function(speciesname, data, fold) {
  title <- paste0(speciesname, " (fold = ", fold, ")")
  plot(data$occurrence_train[,c("longitude", "latitude")], pch=20, col="blue",
       main = title)
  points(data$occurrence_test[,c("longitude", "latitude")], pch=20, col="red")
}

lapply_kfold_species(plot_occurrences, species=species[1:2,],
                     fold_type = "disc", k = 1:2)

Lower level functions