Parallel Processing

Local Machine

When the “parallel_processing” input within “forecast_time_series” is set to “local_machine”, each time series (including training models on the entire data set) is ran in parallel on the users local machine. Each time series will run on a separate core of the machine. Hyperparameter tuning, model refitting, and model averaging will be ran sequentially, which cannot be done in parallel since a parallel process is already running on the machine for each time series. This works well for data that contains many time series where you might only want to run a few simpler models, and in scenarios where cloud computing is not available.

If “parallel_processing” is set to NULL and “run_model_parallel” is set to TRUE within the “forecast_time_series” function, then each time series is ran sequentially but the hyperparameter tuning, model refitting, and model averaging is ran in parallel. This works great for data that has a limited number of time series where you want to run a lot of back testing and build dozens of models within Finn.

Within Azure using Azure Batch

To leverage the full power of Finn, running within Azure is the best choice in building production ready forecasts that can easily scale. The most efficient way to run Finn is to set “parallel_processing” to “azure_batch” and set “run_model_parallel” to TRUE within the “forecast_time_series” function. This will run each time series in separate virtual machines (VM) in Azure. Within each VM, hyperparameter tuning, modeling refitting, and model averaging are all done in parallel across the cores available on the machine.

Azure Batch is a powerful resource from Microsoft Azure, that allows for easily salable parallel compute. Finn leverages the doAzureParallel and rAzureBatch packages built by Microsoft to connect to Azure batch. Refer to their GitHub site for more information about how it works under the hood and how to set up your own Azure Batch resource to use with Finn.

In order to have Finn live on CRAN, it cannot contain any package dependencies that live outside of CRAN. The doAzureParallel and rAzureBatch packages are only on GitHub, so they will have to be installed and called outside of Finn.

Reference the example below to understand how to connect to an Azure Batch compute cluster and submit forecasts to run in the cloud. Make sure to enter your specific Azure account information.

# load CRAN libraries
library(finnts)
library(devtools)

# load GitHub libraries
devtools::install_github("Azure/rAzureBatch")
devtools::install_github("Azure/doAzureParallel")

library(rAzureBatch)
library(doAzureParallel)

# create azure batch cluster info
azure_batch_credentials <- list(
  "sharedKey" = list(
    "batchAccount" = list(
      "name" = "<insert resource name>",
      "key" = "<insert compute key>",
      "url" = "<insert resource URL>"
    ),
    "storageAccount" = list(
      "name" = "<insert resource name>",
      "key" = "<insert compute key>",
      "endpointSuffix" = "core.windows.net"
    )
  ),
  "githubAuthenticationToken" = "",
  "dockerAuthentication" = list("username" = "",
                                "password" = "",
                                "registry" = "")
)

azure_batch_cluster_config <- list(
  "name" = "<insert compute cluster name>",
  "vmSize" = "Standard_D5_v2", # solid VM size that has worked well in the past with Finn forecasts
  "maxTasksPerNode" = 1, # tells the cluster to only run one unique time series for each VM. That enables us to then run another layer of parallel processing within the VM
  "poolSize" = list(
    "dedicatedNodes" = list(
      "min" = 1,
      "max" = 200
    ),
    "lowPriorityNodes" = list(
      "min" = 1,
      "max" = 100
    ),
    "autoscaleFormula" = "QUEUE" # automatically scales up VM's as more jobs get sent to the cluster. 
  ),
  "containerImage" = "mftokic/finn-azure-batch-dev", # docker image you can use that automatically downloads software needed for Finn to run in cloud
  "rPackages" = list(
    "cran" = c('Rcpp', 'modeltime', 'modeltime.resample', 'parsnip', 'tune', 'recipes', 'rsample', 'workflows', 'dials', 'lubridate', 'rules', 'Cubist', 'earth', 'kernlab', 'doParallel', 'dplyr', 'tibble', 'tidyr', 'purrr', 'stringr', 'prophet', 'glmnet', 'gtools'), # finnts package dependencies
    "github" = list(),
    "bioconductor" = list()
  ),
  "commandLine" = list()
)

# create or connect to existing Azure Batch cluster
doAzureParallel::setCredentials(azure_batch_credentials)
cluster <- doAzureParallel::makeCluster(azure_batch_cluster_config)
doAzureParallel::registerDoAzureParallel(cluster)

# call Finn with Azure Batch parallel processing
hist_data <- timetk::m4_monthly %>%
  dplyr::rename(Date = date) %>%
  dplyr::mutate(id = as.character(id))

finn_output <- finnts::forecast_time_series(
  input_data = hist_data,
  combo_variables = c("id"),
  target_variable = "value",
  date_type = "month",
  forecast_horizon = 3, 
  parallel_processing = "azure_batch", 
  run_name = "azure_batch_forecast_test"
)

# optional code to delete compute cluster
parallel::stopCluster(cluster)

The best part of Azure Batch is how it can easily scare to more compute as needed. In the above example, the lowest amount of VM’s running at any time will be 2, and can easily scale to 300 when needed. This allows you to pay for extra compute only when you need it, and allows for forecasts to run that much quicker. You can have separate Finn forecasts (different data sets or inputs) submitted to the same Azure Batch cluster to all run in parallel. How cool is that?!

To keep your Azure resource keys secure without hard coding them into a R script, check out the Azure Key Vault package to safely retrieve and leverage keys/secrets when using Finn.

Important Note: The Finn team is working on new parallel compute options in Azure since Azure Batch in R is deprecated and will eventually stop being supported. New options we are exploring include Spark on Azure Synapse and Azure Functions. Stay tuned!