Forecasting with Multiple Time Series

Nickalus Redell

2019-11-22

Purpose

The purpose of this vignette is to provide a quick overview of forecasting with multiple time-series in forecastML. The benefits to modeling multiple time-series in one go with a single model or ensemble of models include (a) modeling simplicity, (b) potentially more robust results from pooling data across time-series, and (c) solving the cold-start problem when few data points are available for a given time-series.

We won’t concern ourselves with error metrics or hyperparameter evaluation which can be assessed with the return_error() and return_hyper() functions. This functionality is covered in the package overview vignette.

Setup

To forecast with multiple/grouped/hierarchical time-series in forecastML, your data needs the following characteristics:

Example

To illustrate forecasting with multiple time-series, we’ll use the data_buoy dataset that comes with the forecastML package. This dataset consists of daily sensor measurements of several environmental conditions collected by 14 buoys in Lake Michigan from 2012 through 2018. The data were obtained from NOAA’s National Buoy Data Center available at https://www.ndbc.noaa.gov/ using the rnoaa package.

Load packages and data

data_buoy_gaps consists of:

forecastML::fill_gaps

## [[1]]
## [1] "The original dataset with gaps in data collection is 23646 rows."
## 
## [[2]]
## [1] "The modified dataset with no gaps in data collection from fill_gaps() is 31225 rows."

Dynamic features

Plot wind speed outcome

Modeling setup



Model training with nested CV

Training dataset

  • We have 3 datasets for training models that forecast 1, 1 to 7, and 1 to 30 days into the future. We’ll view the 1-day-ahead training data below.
## [1] "The class of `data_train` is grouped_lagged_df"
## [2] "The class of `data_train` is lagged_df"        
## [3] "The class of `data_train` is list"


  • The plot below shows the feature map across forecast horizons. Here, we set all non-dynamic/static features to have the same lags (refer to the custom lags vignette to see how this could be modified). Notice that features that don’t support direct forecasting to the given horizon are silently dropped.

CV setup

  • We’ll model with 3 validation datasets. Given that our measurements are taken daily, we’ll set the skip = 730 argument to skip 2 years between validation datasets.

  • Now we’ll use the group_filter = "buoy_id == 1" argument to get a closer look at 1 of our 14 time-series. The user-supplied filter is passed to dplyr::filter() internally.

User-defined modeling function

  • A user-defined wrapper function for model training that takes the following arguments:
    • 1: A horizon-specific data.frame made with create_lagged_df(..., type = "train") (e.g., my_lagged_df$horizon_h),
    • 2: optionally, any number of additional named arguments which can be passed as ‘…’ in train_model()
    • and returns a model object that will be passed into the user-defined predict() function.

Any data transformations, hyperparameter tuning, or inner loop cross-validation procedures should take place within this function, with the limitation that it ultimately needs to return() a model suitable for the user-defined predict() function; a list can be returned to capture meta-data such as hyperparameter results.

  • Notice that the xgboost-specific input datasets are created within this wrapper function.


Model training

  • This should take ~1 minute to train our ‘3 forecast horizons’ * ‘3 validation datasets’ = 9 models.

  • The user-defined modeling wrapper function could be much more elaborate, in which case many more models could potentially be trained here.

  • These models could be trained in parallel on any OS with the very flexible future package by uncommenting the code below and setting use_future = TRUE. To avoid nested parallelization, models are either trained in parallel across forecast horizons or validation windows, whichever is longer (when equal, the default is parallel across forecast horizons).

## [1] "The class of `model_results_cv` is forecast_model"
## [2] "The class of `model_results_cv` is list"


  • We can access the xgboost model for any horizon or validation window. Here, we show a summary() of the model for the first validation window which is 2012.
##                 Length Class              Mode       
## handle               1 xgb.Booster.handle externalptr
## raw             308719 -none-             raw        
## best_iteration       1 -none-             numeric    
## best_ntreelimit      1 -none-             numeric    
## best_score           1 -none-             numeric    
## niter                1 -none-             numeric    
## evaluation_log       3 data.table         list       
## call                10 -none-             call       
## params               5 -none-             list       
## callbacks            2 -none-             list       
## feature_names      128 -none-             character  
## nfeatures            1 -none-             numeric

Forecasting with nested models

User-defined prediction function

The following user-defined prediction function is needed for each model:

  • A wrapper function that takes the following 2 positional arguments:
    • 1: The model returned from the user-defined modeling function.
    • 2: A data.frame of the model features from forecastML::create_lagged_df(..., type = "train").
  • and returns a data.frame of predictions with 1 or 3 columns. A 1-column data.frame will produce point forecasts, and a 3-column data.frame can be used to return point, lower, and upper forecasts (column names and order do not matter).


Historical model fit

  • Herer, we’re predicting on our validation datasets.
## [1] "The class of `data_pred_cv` is training_results"
## [2] "The class of `data_pred_cv` is forecast_model"  
## [3] "The class of `data_pred_cv` is data.frame"
  • The predictions are solid lines; the actuals are dashed lines.

  • We’ll filter this plot for closer inspection below.


  • It’s somewhat difficult to see how we’ve done here, so we’ll use the group_filter argument again to focus on specific results. Note that the plot() function returns a ggplot object so that we can easily modify our plot.

  • Notice that during the first half of 2015 there were no wind speed measurements recorded; however, we’re getting some interesting variability in predictions here because the day of the month and year dynamic features have captured some seasonality.

Forecasting

  • We have 3 datasets that support forecasting 1, 1 to 7, and 1 to 30 days into the future. We’ll view the 1-day-ahead forecasting data below.
## [1] "The class of `data_forecast` is grouped_lagged_df"
## [2] "The class of `data_forecast` is lagged_df"        
## [3] "The class of `data_forecast` is list"

Dynamic features and forecasting

  • Our dynamic features ‘day’ and ‘year’ were not lagged in our modeling dataset. This was the right choice from a modeling perspective; however, in order to forecast ‘h’ steps ahead, we need to know their future values for each forecast horizon. At present, there’s no function in forecastML to autofill the future values of dynamic, non-lagged features so we’ll simply do it manually below.

Forecast

  • Now we’ll forecast 1, 1:7, and 1:30 days into the future with predict(..., data = data_forecast).

  • The first time step into the future is max(dates) + 1 * frequency. Here, this is 12-31-2018 + 1 * ‘1 day’ or 1-1-2019.

## [1] "The class of `data_forecasts` is forecast_results"
## [2] "The class of `data_forecasts` is forecast_model"  
## [3] "The class of `data_forecasts` is data.frame"
  • Plots for each model–just xgboost here–and time horizon. The separate lines are for the combination of buoy_id * validation window.

  • We’ll filter results to look at the forecasts across validation windows for buoy_id == 1. In this case, a validation window shows how the model would forecast if it were not trained on data in a given time frame.

Model training all data

## [1] "The class of `model_results_no_cv` is forecast_model"
## [2] "The class of `model_results_no_cv` is list"
## [1] "The class of `data_forecasts` is forecast_results"
## [2] "The class of `data_forecasts` is forecast_model"  
## [3] "The class of `data_forecasts` is data.frame"