galah

Matilda Stevenson

2021-08-06

About

galah is an R interface to biodiversity data hosted by the Atlas of Living Australia (ALA). The ALA is a repository of biodiversity data, focussed primarily on observations of individual life forms. Like the Global Biodiversity Information Facility (GBIF), the basic unit of data at ALA is an occurrence record, based on the ‘Darwin Core’ data standard.

galah enables users to locate and download species observations, taxonomic information, or associated media such images or sounds, and to restrict their queries to particular taxa or locations. Users can specify which columns are returned by a query, or restrict their results to observations that meet particular quality-control criteria. All functions return a data.frame as their standard format.

Functions in galah are designed according to a nested architecture. Users that require data should begin by locating the relevant ala_ function (see downloading data section); the arguments within that function then call correspondingly-named select_ functions; and finally the specific values that can be interpreted by those select_ functions are given by find_ functions.

Installation

Install from CRAN:

install.packages("galah")

Install the development version from GitHub:

install.packages("remotes")
remotes::install_github("AtlasOfLivingAustralia/galah")

See the README for system requirements.

Load the package

library(galah)
galah_config(atlas = "Australia")

Filtering data

Each occurrence record contains taxonomic information, and also some information about the observation itself, such as its location and the date of the observation. Each piece of information associated with a given occurrence is stored in a field, which corresponds to a column when imported to a data.frame.

Data fields are important because they provide a means to filter occurrence records; i.e. to return only the information that you need, and no more. Consequently, much of the architecture of galah has been designed to make filtering as simple as possible, by using functions with the select_ prefix.

Taxonomic filtering

select_taxa() enables users search for taxonomic names and check the results are ‘correct’ before using the result to download data. The function allows both free-text searches and searches where the rank(s) are specified. Specifying the rank can be useful when names are ambiguous.

# free text search
taxa_filter <- select_taxa("Eolophus")

# specifying ranks
select_taxa(query = list(genus = "Eolophus", kingdom = "Aves"))
##     search_term scientific_name scientific_name_authorship
## 1 Eolophus_Aves        Eolophus            Bonaparte, 1854
##                                                              taxon_concept_id  rank match_type  kingdom   phylum class
## 1 urn:lsid:biodiversity.org.au:afd.taxon:b2de5e40-df8f-4339-827d-25e63454a4a2 genus exactMatch Animalia Chordata  Aves
##            order     family    genus  issues
## 1 Psittaciformes Cacatuidae Eolophus noIssue

For more detailed taxonomic information use search_taxonomy(), as outlined in vignette("taxonomic_information")

Location-based filtering

Users can provide an sf object or a Well-Known Text (WKT) string for location-based filtering.

locations <- select_locations(query = st_read('act_rect.shp'))

Field based filtering

As mentioned above, all occurrence records in the ALA contain additional information about the record, stored in fields. Field-based filters are specified with select_filters(), which takes individual filters, in the form field = value, and/or a data quality profile.

To find available fields and corresponding valid values, field lookup functions are provided. For finding field names, use search_fields(), for finding valid field values, use find_field_values().

search_fields("basis")
##                                                  id
## 11                                    basisOfRecord
## 186                               raw_basisOfRecord
## 661                         BASIS_OF_RECORD_INVALID
## 726 OCCURRENCE_STATUS_INFERRED_FROM_BASIS_OF_RECORD
##                                                                                                      description
## 11  What this is a record of e.g. specimen, human observation, fossil http://rs.tdwg.org/dwc/terms/basisOfRecord
## 186     The basis of record as supplied by the data publisher http://rs.tdwg.org/dwc/terms/verbatimBasisOfRecord
## 661                                                                                 Basis of record badly formed
## 726                                                              Occurrence status inferred from basis of record
##           type link
## 11      fields <NA>
## 186     fields <NA>
## 661 assertions <NA>
## 726 assertions <NA>
field_values <- find_field_values("basisOfRecord")

Build a field filter

filters <- select_filters(basisOfRecord = "HumanObservation")

It is also possible to pass other kinds of logical statement to select_filters().

filters <- select_filters(basisOfRecord = "HumanObservation",
                          year >= 2010,
                          occurrenceStatus != "absent")

Data quality profiles

A notable extention of the filtering approach is to remove records with low ‘quality’. ALA performs quality control checks on all records that it stores. These checks are used to generate new fields, that can then be used to filter out records that are unsuitable for particular applications. However, there are many possible data quality checks, and it is not always clear which are most appropriate in a given instance. Therefore, galah supports ALA data quality profiles, which can be passed to select_filters()to quickly remove undesirable records. A full list of data quality profiles is returned by find_profiles().

profiles <- find_profiles()

View filters included in a profile

find_profile_attributes("ALA")
##                                                                                                                                                                                                                                                 description
##  1:                                                                                                                                                                                                   Exclude all records where spatial validity is "false"
##  2: Exclude all records with an assertion that the scientific name provided does not match any of the names lists used by the ALA.  For a full explanation of the ALA name matching process see https://github.com/AtlasOfLivingAustralia/ala-name-matching
##  3:                                                                              Exclude all records with an assertion that the scientific name provided is not structured as a valid scientific name. Also catches rank values or values such as "UNKNOWN"
##  4:                                                                                                                              Exclude all records with an assertion that the name and classification supplied can't be used to choose between 2 homonyms
##  5:                                                                                                                                        Exclude all records with an assertion that kingdom provided doesn't match a known kingdom e.g. Animalia, Plantae
##  6:                                                    Exclude all records with an assertion that the scientific name provided in the record does not match the expected taxonomic scope of the resource e.g. Mammal records attributed to bird watch group
##  7:                                                                                                                                                          Exclude all records with an assertion of the occurence is cultivated or escaped from captivity
##  8:                                                                                                                                                                                Exclude all records with an assertion of latitude value provided is zero
##  9:                                                                                                                                                                               Exclude all records with an assertion of longitude value provided is zero
## 10:                                                                                                                                                                   Exclude all records with an assertion of  latitude and longitude have been transposed
## 11:                                                                                                                                                     Exclude all records with an assertion of coordinates are the exact centre of the state or territory
## 12:                                                                                                                                                               Exclude all records with an assertion of  coordinates are the exact centre of the country
## 13:                                                                                                                                                                                               Exclude all records where duplicate status is "duplicate"
## 14:                                                                                                                                                                       Exclude all records where coordinate uncertainty (in metres) is greater than 10km
## 15:                                                                                                                                                                                                    Exclude all records with unresolved user  assertions
## 16:                                                                                                                                                                                                   Exclude all records with unconfirmed  user assertions
## 17:                                                                                                                                                                                              Exclude all records where outlier layer count is 3 or more
## 18:                                                                                                                                                                                              Exclude all records where Record type is "Fossil specimen"
## 19:                                                                                                                                                                                             Exclude all records where Record type is "EnvironmentalDNA"
## 20:                                                                                                                                                                                                  Exclude all records where Presence/Absence is "absent"
## 21:                                                                                                                                                                                                         Exclude all records where year is prior to 1700
##                                                                                                                                                                                                                                                 description
##                                                                     filter
##  1:                                                -spatiallyValid:"false"
##  2:                                           -assertions:TAXON_MATCH_NONE
##  3:                                    -assertions:INVALID_SCIENTIFIC_NAME
##  4:                                              -assertions:TAXON_HOMONYM
##  5:                                            -assertions:UNKNOWN_KINGDOM
##  6:                                       -assertions:TAXON_SCOPE_MISMATCH
##  7:                                          -establishmentMeans:"MANAGED"
##  8:                                                     -decimalLatitude:0
##  9:                                                    -decimalLongitude:0
## 10:                              -assertions:"PRESUMED_SWAPPED_COORDINATE"
## 11:                      -assertions:"COORDINATES_CENTRE_OF_STATEPROVINCE"
## 12:                            -assertions:"COORDINATES_CENTRE_OF_COUNTRY"
## 13:                                          -duplicateStatus:"ASSOCIATED"
## 14:                            -coordinateUncertaintyInMeters:[10001 TO *]
## 15:                                                  -userAssertions:50001
## 16:                                                  -userAssertions:50005
## 17:                                            -outlierLayerCount:[3 TO *]
## 18:                                       -basisOfRecord:"FOSSIL_SPECIMEN"
## 19: -(basisOfRecord:"MATERIAL_SAMPLE" AND contentTypes:"EnvironmentalDNA")
## 20:                                               -occurrenceStatus:ABSENT
## 21:                                                      -year:[* TO 1700]
##                                                                     filter

Include a profile in the filters

filters <- select_filters(basisOfRecord = "HumanObservation",
                          profile = "ALA")

Downloading data

Functions that return data from ALA are named with the prefix ala_, followed by a suffix describing the information that they provide.

By combining different filter functions, it is possible to build complex queries to return only the most valuable information for a given problem. Once you have retrieved taxon information, you can use this to search for occurrence records with ala_occurrences(). However, it is also possible to download data on species via ala_species(), or media content (largely images) via ala_media(). Alternatively, users can retrieve record counts using ala_counts().

Occurrence data

In addition to the filter functions above, when downloading occurrence data users can specify which columns are returned using select_columns(). Individual column names and/or column groups can be specified. To view the fields for each group, see the documentation for select_columns(). To view the list of available fields, run search_fields().

cols <- select_columns("institutionID", group = "basic")

To download occurrence data you will need to specify your email in galah_config(). This email must be associated with an active ALA account. See more information in the config section

galah_config(email = your_email_here, atlas = "Australia")

Download occurrence records for Eolophus roseicapilla

occ <- ala_occurrences(taxa = select_taxa("Eolophus roseicapilla"),
                       filters = select_filters(stateProvince = "Australian Capital Territory",
                                                year >= 2010,
                                                profile = "ALA"),
                       columns = select_columns("institutionID", group = "basic"))
head(occ)
##   decimalLatitude decimalLongitude            eventDate        scientificName
## 1       -35.88717         148.9713                      Eolophus roseicapilla
## 2       -35.86784         149.0101                      Eolophus roseicapilla
## 3       -35.86556         149.0106 2012-01-18T13:00:00Z Eolophus roseicapilla
## 4       -35.86429         149.0052                      Eolophus roseicapilla
## 5       -35.77517         148.9591                      Eolophus roseicapilla
## 6       -35.76652         148.9654                      Eolophus roseicapilla
##                                                                taxonConceptID                             recordID
## 1 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 17f46d49-7db0-4929-89f4-b29323f3fcc5
## 2 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 ef2b9066-c078-4660-b9a4-c31192aa8bf7
## 3 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 4f7cd714-2997-45d6-adaf-f7dfc80adfe1
## 4 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 3236c470-a144-4300-ae9b-782d0e5e4dd1
## 5 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 e340c423-3a19-4293-aa72-e4045bb7f702
## 6 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 f25793f3-2704-43ff-80ee-c2a1787490e7
##              dataResourceName institutionID
## 1             eBird Australia              
## 2             eBird Australia              
## 3 BirdLife Australia, Birdata              
## 4             eBird Australia              
## 5             eBird Australia              
## 6             eBird Australia

Species data

A common use case of the ALA is to identify which species occur in a specified region, time period, or taxonomic group. ala_species() enables the user to look up this information, using the common set of filter functions.

# List rodent species in the NT
species <- ala_species(taxa = select_taxa("Rodentia"),
            filters = select_filters(stateProvince = "Northern Territory"))
head(species)
##    kingdom   phylum    class    order  family        genus                     species            author
## 1 Animalia Chordata Mammalia Rodentia Muridae Mesembriomys        Mesembriomys gouldii (J.E. Gray, 1843)
## 2 Animalia Chordata Mammalia Rodentia Muridae      Zyzomys             Zyzomys argurus    (Thomas, 1889)
## 3 Animalia Chordata Mammalia Rodentia Muridae    Pseudomys Pseudomys hermannsburgensis     (Waite, 1896)
## 4 Animalia Chordata Mammalia Rodentia Muridae      Notomys              Notomys alexis      Thomas, 1922
## 5 Animalia Chordata Mammalia Rodentia Muridae      Melomys             Melomys burtoni    (Ramsay, 1887)
## 6 Animalia Chordata Mammalia Rodentia Muridae          Mus                Mus musculus    Linnaeus, 1758
##                                                                  species_guid        vernacular_name
## 1 urn:lsid:biodiversity.org.au:afd.taxon:f38bcd7e-ae6a-4734-bd64-06995bc230eb  Black-footed Tree-rat
## 2 urn:lsid:biodiversity.org.au:afd.taxon:46611113-a1e3-45b1-b58c-7aef088a9da7        Common Rock-rat
## 3 urn:lsid:biodiversity.org.au:afd.taxon:5d73fc2f-3caa-4b44-aa40-3711e8304f80     Sandy Inland Mouse
## 4 urn:lsid:biodiversity.org.au:afd.taxon:49001532-929e-4b78-97d3-c885e97d671b Spinifex Hopping-mouse
## 5 urn:lsid:biodiversity.org.au:afd.taxon:89dfa41e-2c5a-44d1-80bf-8d4cd3c73089      Grassland Melomys
## 6 urn:lsid:biodiversity.org.au:afd.taxon:107696b5-063c-4c09-a015-6edfdb6f4d52            House Mouse

Record counts

ala_counts() provides summary counts on records in the ALA, without needing to download all the records. In addition to the filter arguments, it has an optional group_by argument, which provides counts binned by the requested field.

# Total number of records in the ALA
ala_counts()
## [1] 100871912
# Total number of records, broken down by kindgom
ala_counts(group_by = "kingdom")
##      kingdom    count
## 1   Animalia 75649166
## 2    Plantae 21472247
## 3      Fungi  1877219
## 4  Chromista   914334
## 5   Protista    67279
## 6   Bacteria    58081
## 7   Protozoa    22681
## 8    Archaea     1103
## 9  Eukaryota      735
## 10     Virus      421

Media downloads

In addition to text data describing individual occurrences and their attributes, ALA stores images, sounds and videos associated with a given record. These can be downloaded to R using ala_media() and the same set of filters as the other data download functions.

# Use the occurrences previously downloaded
media_data <- ala_media(
     taxa = select_taxa("Eolophus roseicapilla"),
     filters = select_filters(year = 2020),
     download_dir = "media")

Config

Various aspects of the galah package can be customized. To preserve configuration for future sessions, set profile_path to a location of a .Rprofile file.

Email

To download occurrence records, you will need to provide an email address registered with the ALA. You can create an account here. Once an email is registered with the ALA, it should be stored in the config:

galah_config(email="myemail@gmail.com")

Caching

galah can cache most results to local files. This means that if the same code is run multiple times, the second and subsequent iterations will be faster.

By default, this caching is session-based, meaning that the local files are stored in a temporary directory that is automatically deleted when the R session is ended. This behaviour can be altered so that caching is permanent, by setting the caching directory to a non-temporary location.

galah_config(cache_directory="example/dir")

By default, caching is turned off. To turn caching on, run

galah_config(caching=FALSE)

Debugging

If things aren’t working as expected, more detail (particularly about web requests and caching behaviour) can be obtained by setting the verbose configuration option:

galah_config(verbose=TRUE)

Setting the download reason

ALA requires that you provide a reason when downloading occurrence data (via the galah ala_occurrences() function). The reason is set as “scientific research” by default, but you can change this using galah_config(). See find_reasons() for valid download reasons.

galah_config(download_reason_id=your_reason_id)