Memory and Parallelization

Victor Granda (Sapfluxnet Team)

2019-04-30

Memory

In order to be able to work with the whole database at the sapwood or plant level it is recommended at least \(16GB\) of RAM memory. This is because loading all data objects already consumes \(4GB\) and any operation like aggregation or metric calculation results in extra memory needed:

library(sapfluxnetr)
library(tidyverse)

# This will need at least 5GB of memory during the process
folder <- 'RData/plant'
sfn_metadata <- read_sfn_metadata(folder)

daily_results <- sfn_sites_in_folder(folder) %>%
  filter_sites_by_md(
    si_biome %in% c("Temperate forest", 'Mediterranean'),
    sites = sites, metadata = sfn_metadata
  ) %>%
  read_sfn_data(folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)

# Important to save, this way you will have access to the object in the future
save(daily_results, file = 'daily_results.RData')

To circumvent this in less powerful systems, we recommend to work in small subsets of sites (25-30) and join the tidy results afterwards:

library(sapfluxnetr)
library(tidyverse)

folder <- 'RData/plant'
metadata <- read_sfn_metadata(folder)
sites <- sfn_sites_in_folder(folder) %>%
  filter_sites_by_md(
    si_biome %in% c("Temperate forest", 'Mediterranean'),
    sites = sites, metadata = sfn_metadata
  )

daily_results_1 <- read_sfn_data(sites[1:30], folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)
daily_results_2 <- read_sfn_data(sites[31:60], folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)
daily_results_3 <- read_sfn_data(sites[61:90], folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)
daily_results_4 <- read_sfn_data(sites[91:110], folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)

daily_results_steps <- bind_rows(
  daily_results_1, daily_results_2,
  daily_results_3, daily_results_4
)

rm(daily_results_1, daily_results_2, daily_results_3, daily_results_4)
save(daily_results_steps, file = 'daily_results_steps.RData')

Parallelization

sapfluxnetr includes tha capability to parallelize the metrics calculation when performed on a sfn_data_multi object. This is made thanks to the furrr package (which uses the future package behind the scenes). By default, the code will run in a sequential process, which is the usual way the R code runs. But setting the future::plan to multicore (in Linux), multisession (in Windows) or multiprocess (automatically choose between the previous plans depending on the system) will run the code in parallel, dividing the sites between the available cores.

> Be advised, parallelization usually means more RAM used, so in systems
  with less than 16GB maybe is not a good idea.
  Also, the time benefits start to show when analysing 10 sites or above.
# loading future package
library(future)

# setting the plan
plan('multiprocess')

# metrics!!
daily_results_parallel <- sfn_sites_in_folder(folder) %>%
  filter_sites_by_md(
    si_biome %in% c("Temperate forest", 'Mediterranean'),
    sites = sites, metadata = sfn_metadata
  ) %>%
  read_sfn_data(folder) %>%
  daily_metrics(tidy = TRUE, metadata = sfn_metadata)

# Important to save, this way you will have access to the object in the future
save(daily_results_parallel, file = 'daily_results_parallel.RData')

Memory limit

When using furrr, even in the sequential plan, the future package sets a limit of \(500MB\) for each core. With sapfluxnet data this limit is easily exceeded, causing an error. To avoid this we may want to set the future.globals.maxSize limit to a higher value (\(1GB\) for example, but the limit wanted really depend on the plan and the number of sites):