Introduction

The gdalcubes R package aims at making the work with large collections of Earth observation (EO) imagery (e.g. from Sentinel 2) easier and faster. Typical challenges with these data such as overlapping images, different spatial resolutions of spectral bands, irregular temporal sampling, and different spatial reference systems become invisible to users by reading the data as a raster data cube and letting users define the shape of the cube (spatiotemporal extent, resolution, and spatial reference system). Working with EO imagery then becomes more interactive: going from “try my method on low resolution and get the result asap” to “apply my final method to the full resolution dataset over night” becomes straightforward.

This brief vignette illustrates basic ideas of the package. We will use satellite imagary from the Moderate Resolution Imaging Spectroradiometer (MODIS) that is small enough to process even on older machines. The imagery comes as a set of HDF4 files. We assume that you have successfully installed the gdalcubes R package. Please also make sure that your GDAL installation supports the HDF4 driver (e.g. with gdalcubes_gdalformats()).

In the following, we will follow a simple workflow by

  1. indexing image files in an image collection,
  2. creating data cubes at various spatiotemporal resolutions and extents, and
  3. applying simple operations on data cubes including band selection, application of pixel-wise functions, reduction over time, and time series filtering.

Downloading the sample data

We will use 8-daily land surface temperature from the MODIS product MOD11A2, covering western Europe (tiles v=13,14, h=03,04) from January to September 2018. The zip archive has approximately 600 megabytes. The code below downloads and unzips the data to the current working directory.

dest_dir = tempdir()
download.file("https://uni-muenster.sciebo.de/s/eP9E6OIkQbXrmsY/download", destfile=file.path(dest_dir, "MOD11A2.zip"),mode = "wb")
unzip(file.path(dest_dir, "MOD11A2.zip"), exdir = file.path(dest_dir,"MOD11A2"))
unlink(file.path(dest_dir, "MOD11A2.zip"))

Creating an image collection

As a first step, we must combine the set of files for the first MODIS product (MOD11A2) in a single image collection. The image collection is a simple index, pointing to files and storing the spatial extent, spatial reference system, aquisition date/time of images and how files relate to image bands. The package comes with a set of predefined rules (called image collection formats), how this information can be extracted from filenames for selected EO products. A list of available collection formats including a short description can be printed with:

library(gdalcubes)
## Using gdalcubes library version 0.2.3
collection_formats()
##    CHIRPS_v2_0_daily_p05_tif | Image collection format for CHIRPS v 2.0
##                              | daily global precipitation dataset (0.05
##                              | degrees resolution) from GeoTIFFs, expects
##                              | list of .tif or .tif.gz files as input.
##                              | [TAGS: CHIRPS, precipitation]
##  CHIRPS_v2_0_monthly_p05_tif | Image collection format for CHIRPS v 2.0
##                              | monthly global precipitation dataset (0.05
##                              | degrees resolution) from GeoTIFFs, expects
##                              | list of .tif or .tif.gz files as input.
##                              | [TAGS: CHIRPS, precipitation]
##                      L8_L1TP | Collection format for Landsat 8 Level 1 TP
##                              | product [TAGS: Landsat, USGS, Level 1,
##                              | NASA]
##                        L8_SR | Collection format for Landsat 8 surface
##                              | reflectance product [TAGS: Landsat, USGS,
##                              | Level 2, NASA, surface reflectance]
##                      MxD10A2 | Collection format for selected bands from
##                              | the MODIS MxD10A2 (Aqua and Terra) v006
##                              | Snow Cover product [TAGS: MODIS, Snow
##                              | Cover]
##                      MxD11A1 | Collection format for selected bands from
##                              | the MODIS MxD11A2 (Aqua and Terra) v006
##                              | Land Surface Temperature product [TAGS:
##                              | MODIS, LST]
##                      MxD11A2 | Collection format for selected bands from
##                              | the MODIS MxD11A2 (Aqua and Terra) v006
##                              | Land Surface Temperature product [TAGS:
##                              | MODIS, LST]
##                      MxD13A2 | Collection format for selected bands from
##                              | the MODIS MxD13A2 (Aqua and Terra) product
##                              | [TAGS: MODIS, VI, NDVI, EVI]
##                      MxD13A3 | Collection format for selected bands from
##                              | the MODIS MxD13A3 (Aqua and Terra) product
##                              | [TAGS: MODIS, VI, NDVI, EVI]
##                      MxD13Q1 | Collection format for selected bands from
##                              | the MODIS MxD13Q1 (Aqua and Terra) product
##                              | [TAGS: MODIS, VI, NDVI, EVI]
##                      MxD14A2 | Collection format for the MODIS MxD14A2
##                              | (Aqua and Terra) product [TAGS: MODIS,
##                              | Fire]
## PlanetScope_3B_AnalyticMS_SR | Image collection format for PlanetScope
##                              | 4-band scenes [TAGS: PlanetScope, BOA,
##                              | Surface Reflectance]
##             Sentinel1_IW_GRD | Image collection format for Sentinel 1
##                              | Level 1 GRD data as downloaded from the
##                              | Copernicus Open Access Hub, expects a list
##                              | of file paths as input. The format works on
##                              | original ZIP compressed as well as
##                              | uncompressed imagery. [TAGS: Sentinel,
##                              | Copernicus, ESA, SAR]
##                Sentinel2_L1C | Image collection format for Sentinel 2
##                              | Level 1C data as downloaded from the
##                              | Copernicus Open Access Hub, expects a list
##                              | of file paths as input. The format works on
##                              | original ZIP compressed as well as
##                              | uncompressed imagery. [TAGS: Sentinel,
##                              | Copernicus, ESA, TOA]
##            Sentinel2_L1C_AWS | Image collection format for Sentinel 2
##                              | Level 1C data in AWS [TAGS: Sentinel,
##                              | Copernicus, ESA, TOA]
##                Sentinel2_L2A | Image collection format for Sentinel 2
##                              | Level 2A data as downloaded from the
##                              | Copernicus Open Access Hub, expects a list
##                              | of file paths as input. The format should
##                              | work on original ZIP compressed as well as
##                              | uncompressed imagery. [TAGS: Sentinel,
##                              | Copernicus, ESA, BOA, Surface Reflectance]
##          Sentinel2_L2A_THEIA | Image collection format for Sentinel 2
##                              | Level 2A data as downloaded from Theia.
##                              | [TAGS: Sentinel, ESA, Flat Reflectance,
##                              | Theia]

In this case, MxD11A2 is the correct format for our datasets. Internally, collection formats are defined in relatively simple JSON files, presets for other products will be added continuously.

To create the image collection, we must pass a list of our files and the collection format to the create_image_collection() function. The code below finds all files as character vector of GDAL datasets, which then can be passed as first argument to create_image_collection(). The second argument here referes to the image collection format and the third argument provides the name of the output image collection file (which is simply an SQLite database). The collection format also knows that bands are stored as subdatasets in the HDF files.

img_col_file = tempfile(fileext=".db")
files = list.files(file.path(dest_dir,"MOD11A2"), pattern=".hdf$", full.names = TRUE)
create_image_collection(files, "MxD11A2", img_col_file)
## A GDAL image collection object, referencing 140 images with 8  bands
## Images:
##                                                                name
## 1 /tmp/RtmpNCepL7/MOD11A2/MOD11A2.A2018001.h17v03.006.2018011145329
## 2 /tmp/RtmpNCepL7/MOD11A2/MOD11A2.A2018001.h17v04.006.2018011145438
## 3 /tmp/RtmpNCepL7/MOD11A2/MOD11A2.A2018001.h18v03.006.2018011145428
## 4 /tmp/RtmpNCepL7/MOD11A2/MOD11A2.A2018001.h18v04.006.2018011145326
## 5 /tmp/RtmpNCepL7/MOD11A2/MOD11A2.A2018009.h17v03.006.2018018034330
## 6 /tmp/RtmpNCepL7/MOD11A2/MOD11A2.A2018009.h17v04.006.2018018034246
##        left top bottom    right            datetime
## 1 -20.00000  60     50  0.00000 2018-01-01T00:00:00
## 2 -15.55724  50     40  0.00000 2018-01-01T00:00:00
## 3   0.00000  60     50 20.00000 2018-01-01T00:00:00
## 4   0.00000  50     40 15.55724 2018-01-01T00:00:00
## 5 -20.00000  60     50  0.00000 2018-01-09T00:00:00
## 6 -15.55724  50     40  0.00000 2018-01-09T00:00:00
##                                                                                  srs
## 1 +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs 
## 2 +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs 
## 3 +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs 
## 4 +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs 
## 5 +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs 
## 6 +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs 
## [ omitted 134 images ] 
## 
## Bands:
##              name offset scale unit     nodata image_count
## 1   DAY_VIEW_TIME   0.00 0.100  hrs 255.000000         140
## 2         EMIS_31   0.49 0.002        0.000000         140
## 3         EMIS_32   0.49 0.002        0.000000         140
## 4         LST_DAY   0.00 0.020    K   0.000000         140
## 5       LST_NIGHT   0.00 0.020    K   0.000000         140
## 6 NIGHT_VIEW_TIME   0.00 0.100  hrs 255.000000         140
## 7          QC_DAY   0.00 1.000                         140
## 8        QC_NIGHT   0.00 1.000                         140

Creating a data cube

The created image collection has references to original images on disk and knows about the datetime, spatial extent, and coordinate reference system of images. The raster_cube() function creates a data cube from an image collection. This function expects up to three arguments:

  1. an image collection object as created above,
  2. a data cube view defining the spatiotemporal resolution and extent and the spatial reference system of the target cube, and
  3. the internal size of chunks how the cube is partitioned in memory

If no data cube view is provided, a default cube that covers the whole extent of the collection at low resolution is created. The chunk size defaults to (16 x 256 x 256) pixels in time, y, and x directions. Below, we create a data cube with a default cube view and print some basic information about its dimensions and bands.

x = raster_cube(image_collection(img_col_file))
x
## A GDAL data cube proxy object
## 
## Dimensions:
##                 low             high count       pixel_size chunk_size
## t           2018-01          2018-10     4              P3M         16
## y  4447802.07906609 6671703.11859914   413 5384.74827974103        256
## x -1703607.03338081 1703607.03338081   633 5382.64465523162        256
## 
## Bands:
##              name offset scale nodata unit
## 1   DAY_VIEW_TIME   0.00 0.100    NaN  hrs
## 2         EMIS_31   0.49 0.002    NaN     
## 3         EMIS_32   0.49 0.002    NaN     
## 4         LST_DAY   0.00 0.020    NaN    K
## 5       LST_NIGHT   0.00 0.020    NaN    K
## 6 NIGHT_VIEW_TIME   0.00 0.100    NaN  hrs
## 7          QC_DAY   0.00 1.000    NaN     
## 8        QC_NIGHT   0.00 1.000    NaN

bands(x)
##              name offset scale nodata unit
## 1   DAY_VIEW_TIME   0.00 0.100    NaN  hrs
## 2         EMIS_31   0.49 0.002    NaN     
## 3         EMIS_32   0.49 0.002    NaN     
## 4         LST_DAY   0.00 0.020    NaN    K
## 5       LST_NIGHT   0.00 0.020    NaN    K
## 6 NIGHT_VIEW_TIME   0.00 0.100    NaN  hrs
## 7          QC_DAY   0.00 1.000    NaN     
## 8        QC_NIGHT   0.00 1.000    NaN
dimensions(x)
## $t
## $t$low
## [1] "2018-01"
## 
## $t$high
## [1] "2018-10"
## 
## $t$count
## [1] 4
## 
## $t$pixel_size
## [1] "P3M"
## 
## $t$chunk_size
## [1] 16
## 
## 
## $y
## $y$low
## [1] 4447802
## 
## $y$high
## [1] 6671703
## 
## $y$count
## [1] 413
## 
## $y$pixel_size
## [1] 5384.748
## 
## $y$chunk_size
## [1] 256
## 
## 
## $x
## $x$low
## [1] -1703607
## 
## $x$high
## [1] 1703607
## 
## $x$count
## [1] 633
## 
## $x$pixel_size
## [1] 5382.645
## 
## $x$chunk_size
## [1] 256
srs(x)
## [1] "+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs "

Notice that raster_cube() will not run any computations besides deriving the shape of the output cube. Instead, it will return a proxy object that will not be evaluated until data must be actually read (e.g. when calling plot). This not only applies to data cubes from image collections but also for derived cubes (see further below). In most cases, however, users want to specify the extent and resolution manually. Above, the temporal resolution of the cube was 3 months whereas below, we define a custom data cube view with temporal resolution of one month.

srs="+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs"
MOD11A2.col = image_collection(img_col_file)
v = cube_view(srs=srs, extent=MOD11A2.col, nx = 400, dt="P1M", aggregation = "mean")
v
## A data cube view object
## 
## Dimensions:
##                 low             high count       pixel_size
## t           2018-01          2018-09     9              P1M
## y  4447802.07906609 6671703.11859914   261 8520.69363805765
## x -1703607.03338081 1703607.03338081   400 8518.03516690405
## 
## SRS: "+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs"
## Temporal aggregation method: "mean"
## Spatial resampling method: "near"
MOD11A2.cube = raster_cube(MOD11A2.col, v)
MOD11A2.cube
## A GDAL data cube proxy object
## 
## Dimensions:
##                 low             high count       pixel_size chunk_size
## t           2018-01          2018-09     9              P1M         16
## y  4447802.07906609 6671703.11859914   261 8520.69363805765        256
## x -1703607.03338081 1703607.03338081   400 8518.03516690405        256
## 
## Bands:
##              name offset scale nodata unit
## 1   DAY_VIEW_TIME   0.00 0.100    NaN  hrs
## 2         EMIS_31   0.49 0.002    NaN     
## 3         EMIS_32   0.49 0.002    NaN     
## 4         LST_DAY   0.00 0.020    NaN    K
## 5       LST_NIGHT   0.00 0.020    NaN    K
## 6 NIGHT_VIEW_TIME   0.00 0.100    NaN  hrs
## 7          QC_DAY   0.00 1.000    NaN     
## 8        QC_NIGHT   0.00 1.000    NaN

Aggregation and resampling

Besides the spatiotemporal extent, the resolution and the spatial reference system, the data cube view contains the two important parameters aggregation and resampling. Resampling here refers to how images are resampled in space during the warp operation. The temporal aggregation method defines how values for the same cell from different images are combined in the target cube. For example, a data cube with monthly temporal resolution will contain values from multiple images. Currently, possible values are first, last, min, max, mean, and median. These functions are evaluated per data cube pixel.

Data cube operations

The package comes with a few operations on data cubes to (i) select bands (select_bands()), (ii) apply pixel-wise arithmetic expressions (apply_pixel()), (iii) reduce data cubes over space and time (reduce_time(), reduce_space()), apply a moving-window function over time (window_time()), (iv) apply an R function over chunks of a data cube (chunk_apply()), and (v) join bands of identically shaped data cubes (join_bands()). All of these functions produce a proxy data cube, storing the shape of the result cube but not any data. They all take a (proxy) data cube as first argument and can be chained. The following code demonstrates some of the operations and how data cubes can be plotted.

MOD11A2.bandselect = select_bands(MOD11A2.cube, c("LST_DAY","LST_NIGHT"))
MOD11A2.daynight_difference = apply_pixel(MOD11A2.bandselect, "0.02*(LST_DAY-LST_NIGHT)",names = "LST_difference")
MOD11A2.reduce = reduce_time(MOD11A2.daynight_difference, "median(LST_difference)")
plot(MOD11A2.reduce, col=heat.colors, key.pos=1)

The result is the median day night surface temperature difference for all pixels between Jan and December 2018. Notice that no data is actually read until we call plot(), i.e. all operations again return proxy objects. Replacing the plot function with write_ncdf(MOD11A2.reduce, "test.nc") would write the result as a NetCDF file to disk.

Operations on data cubes are designed such that they can be used with the pipe operator. The following code for example derives monthly differences of land surface temperature, from June to September 2018. The window_time function here applies the simple time series kernel filter \(T_t - T_{t-1}\).

library(magrittr) # get the pipe
v1 = cube_view(view=v, extent=list(t0="2018-06", t1="2018-09"))
raster_cube(image_collection(img_col_file), v1)  %>%
  select_bands(c("LST_DAY")) %>%
  apply_pixel("LST_DAY * 0.02") %>% # apply scale
  window_time(kernel=c(-1,1), window=c(1,0)) %>%
  plot(col=terrain.colors, key.pos=1, zlim=c(-25,25))

Multithreading

gdalcubes supports multithreaded evaluation of data cube operations. Calling gdalcubes_set_threads(n) will tell all following data cube operations to use up to n threads. Since worker threads are assigned to chunks of the data cube, the actual number of threads used is lower for cubes with less than n chunks.

Future work

This vignette presents a very simple example with a small dataset. Future vignettes will focus on (i) how user-defined R functions can be applied on data cubes, (ii) how to process larger datasets (Sentinel 2) with overlapping images from different spatial reference systems, and (iii) how to run gdalcubes in the cloud.