When there is a need to select the most appropriate forecasting model or method for the data, the forecasters usually split the available sample into two parts: in-sample (aka “training set”) and holdout sample (or out-sample, or “test set”). The model is then estimated on in-sample and its forecasting performance is evaluated using some error measure on the holdout sample.

If such a procedure done only once, then this is called “fixed origin” evaluation. However, the time series might contain outliers or level shifts and a poor model might perform better than the more appropriate one only because of that. In order to robustify the evaluation of models, something called “rolling origin” is used.

Rolling origin is an evaluation technique according to which the forecasting origin is updated successively and the forecasts are produced from each origin (Tashman 2000). This technique allows obtaining several forecast errors for time series, which gives a better understanding of how the models perform. There are different options of how this can be done.

The figure below (from (Svetunkov and Petropoulos 2018)) depicts the basic idea of rolling origin. White cells correspond to the in-sample data, while the light grey cells correspond to the three-steps-ahead forecasts. Time series has 25 observations in that figure, and the forecasts are produced from 8 origins, starting from the origin 15. The model is re-estimated on each iteration, and the forecasts are produced. After that a new observation is added at the end of the series and the procedure continues. The process stops when there is no more data to add. This could be considered as a rolling origin with a **constant holdout** sample size. As a result of this procedure 8 one to three steps ahead forecasts are produced. Based on them we can calculate the preferred error measures and choose the best performing model.

Another option of producing forecasts from 8 origins would be to start from the origin 17 instead of 15 (see Figure below). In this case the procedure continues until origin 22, when the last three-steps-ahead forecast is produced, and then continues with the decreasing forecasting horizon. So the two-steps-ahead forecast is produced from the origin 23 and only one-step-ahead forecast is produced from the origin 24. As a result we obtain 8 one-step-ahead forecasts, 7 two-steps-ahead forecasts and 6 three-steps-ahead forecasts. This can be considered as a rolling origin with a **non-constant holdout** sample size. This can be useful in cases of small samples, when we don’t have any observations to spare.

Finally, in both of the cases above we had the **increasing in-sample** size. However for some research purposes we might need a **constant in-sample**. The figure below demonstrates such a situation. In this case on each iteration we add an observation at the end of the series and remove one from the beginning of the series (dark grey cells).

The function `ro()`

from `greybox`

package (written by Yves Sagaert and Ivan Svetunkov in 2016 on the way to the International Symposium on Forecasting) implements the rolling origin evaluation for any function you like with a predefined `call`

and returns the desired `value`

. It heavily relies on the two variables: `call`

and `value`

- so it is quite important to understand how to formulate them in order to get the desired results. Overall, `ro()`

is a very flexible function, but, as a result, it is not very simple. Let’s see how it works.

We start with a simple example, generating series from normal distribution:

We use ARIMA(0,1,1) for this example:

The call that we specify includes two important elements: `data`

and `h`

. `data`

specifies where the in-sample values are located in the function that we want to use, and it needs to be called “data” in the call. `h`

will tell our function, where the forecasting horizon is specified in the selected function. Note that in this example we use `arima(x=data,order=c(0,1,1))`

, which produces a desired ARIMA(0,1,1) model and then we use `predict(...,n.ahead=h)`

, which produces a forecast from that model. The part `arima(x=data,order=c(0,1,1))`

can also be simplified to `arima(data,order=c(0,1,1))`

according to the general rules of R.

Having the call, we need also to specify what the function should return. This can be the conditional mean (point forecasts), prediction intervals, the parameters of a model, or, in fact, anything that the model returns (e.g. name of the fitted model and its likelihood). However, there are some differences in what the `ro()`

returns depending on what the function you use returns. If it is a vector, then `ro()`

will produce a matrix (with values for each origin in columns). If it is a matrix, then an array is returned. Finally, if it is a list, then a list of lists is returned.

In order not to overcomplicate things, let’s start with collecting the conditional mean from the `predict()`

function:

Now that we have specified `ourCall`

and `ourValue`

, we can produce forecasts from the model using rolling origin. Let’s say that we want three-steps-ahead forecasts and 8 origins with the default values of all the other parameters:

The function returns a list with all the values that we asked for plus the actual values from the holdout sample. We can calculate some basic error measure based on those values, for example, scaled Mean Absolute Error (Petropoulos and Kourentzes 2015):

```
apply(abs(returnedValues1$holdout - returnedValues1$pred),1,mean,na.rm=TRUE) / mean(returnedValues1$actuals)
#> h1 h2 h3
#> 0.06984269 0.05315813 0.05700642
```

In this example we use `apply()`

function in order to distinguish between the different forecasting horizons and have an idea of how the model performs for each of them. In a similar manner we could evaluate the performance of some other model and compare the errors with the ones produced by the first one. These numbers do not tell us much on their own, but if we compared the performance of this model with another one, then we could infer if one model is more appropriate to the data than the other one.

We can also plot the forecasts from the rolling origin, which shows how the selected model behaves:

In this example the forecasts from different origins are close to each other. This is because the data is stationary and the model is quite stable.

If we have a look at the returned matrices, we will notice that they contain NAs:

```
returnedValues1$holdout
#> origin93 origin94 origin95 origin96 origin97 origin98 origin99
#> h1 77.91140 101.65422 85.88601 97.05167 97.66526 103.8089 88.7737
#> h2 101.65422 85.88601 97.05167 97.66526 103.80892 88.7737 101.9843
#> h3 85.88601 97.05167 97.66526 103.80892 88.77370 101.9843 NA
#> origin100
#> h1 101.9843
#> h2 NA
#> h3 NA
```

This is because by default the holdout sample is set to be non-constant: `co=FALSE`

(constant out-sample = FALSE). The in-sample is also set to non-constant, that is why the model will be re-estimated on each iteration on the increasing sample. If we want to change that we might call `ro()`

differently:

Note that the values from the `returnedValues2`

are not directly comparable with the ones from `returnedValues1`

, because they are produced from the different origins. This becomes more apparent when we plot things:

If you decide to use functions from `forecast`

package, the call and returned values can be modified the following way:

Note that we ask here for only one level. If we don’t then the `forecast`

function will return matrices for upper and lower values instead of vectors, and those cannot be adequately transformed into the desired format, so the ro() will return an error.

As you see, `ro()`

is a convenient function, when you have one model and one time series. But what if you need to apply different models to different time series? We would need a loop, right? Yes. And there is a simple way to use `ro()`

in this case. Let’s continue our example but now introduce several time series:

We would need an array of the returned values for this example:

Here we will have 3 time series, 2 models and 3-steps-ahead forecasts from 8 origins. Our models will be saved in a separate lists. We will have ARIMA(0,1,1) and ARIMA(1,1,0) in this example:

We will return the same `pred`

value from the function, as we did before, but we need to change the call, as now we will have to take these two different models into account:

As you see, instead of specifying the model directly, we use the i-th element of the list.

We also want to save the actual values from the holdout in order to be able to calculate the error measures:

This array has dimensions for 3 time series and 3-steps-ahead forecast from 8 origins.

Finally, we can write a loop and produce the forecasts:

```
for(j in 1:3){
for(i in 1:2){
ourdata <- x[,j]
ourROReturn <- ro(data=ourdata, h=3, origins=8, call=ourCall,
value=ourValue, co=TRUE)
ourForecasts[j,i,,] <- ourROReturn$pred
}
ourHoldoutValues[j,,] <- ourROReturn$holdout
}
```

Although we do not specify `i`

explicitly anywhere in the call of `ro()`

above, it is used in `ourCall`

and as a result the different models will be estimated in the inner loop. Comparing the performance of the two on the different time series we have (this is RelMAE from (Davydenko and Fildes 2013)):

```
exp(mean(log(apply(abs(ourHoldoutValues - ourForecasts[,1,,]),1,mean,na.rm=TRUE) / apply(abs(ourHoldoutValues - ourForecasts[,2,,]),1,mean,na.rm=TRUE))))
#> [1] 0.7926177
```

So based on these results, it can be concluded, that ARIMA(0,1,1) is on average more accurate than ARIMA(1,1,0) on our three time series.

For our last examples we create the data frame and try fitting linear regression:

```
xreg <- matrix(rnorm(120*3,c(100,50,150),c(10,5,15)), 120, 3, byrow=TRUE)
y <- 0.5*xreg[,1] + 0.2*xreg[,2] + 0.75*xreg[,3] + rnorm(120,0,10)
xreg <- cbind(y,xreg)
colnames(xreg) <- c("y",paste0("x",c(1:3)))
xreg <- as.data.frame(xreg)
```

Note that in this example we cheat `ro()`

function and make a call, that does not contain either `data`

or `h`

, because the regression implemented in `lm()`

function relies on data frames and does not use forecasting horizon:

In this case we just need to make sure that the Global Environment contains the xreg data frame. `counti`

variable in the call is needed for the internal loop - it determines the length of the in-sample. Similarly `counto`

specifies the length of the holdout. Both of them are defined inside the `ro()`

function.

In addition, we don’t need to specify `ourValue`

, because the function `predict.lm()`

returns a matrix with values (or a vector if we don’t ask for intervals), not a list. The final call to `ro()`

is as usual pretty simple:

As a final example, we consider ARIMAX model for the following data:

```
xreg <- matrix(rnorm(120*3,c(100,50,150),c(10,5,15)), 120, 3, byrow=TRUE)
y <- 0.5*xreg[,1] + 0.2*xreg[,2] + 0.75*xreg[,3] + rnorm(120,0,10)
colnames(xreg) <- paste0("x",c(1:3))
xreg <- as.data.frame(xreg)
```

and modify the call accordingly:

Taking that now we deal with ARIMA, we need to specify both `data`

and `h`

. Furthermore, `xreg`

is different than in the previous example, as it now should not contain the response variable.

As usual, we need our function to return the specific predicted values:

And then we call for the function and get the results:

As a side note, if you use `smooth`

package for R and want to do rolling origin evaluation, for example, for ETSX model using `es()`

function, the call can be simplified to:

Here we use `countf`

variable, which specifies that we need to use both in-sample and the holdout (the full sample), as the function `es()`

does not rely on `forecast()`

or `predict()`

and produces forecasts internally.

Finally, all of the examples mentioned above can be done in parallel, which is triggered by `parallel`

parameter. Note that this might be useful for cases, when number of origins is very high and the sample size is large. Otherwise this might be less efficient than the serial calculation. If you want to do anything in parallel, then you either need either `doMC`

(Linux and Mac OS) or `doParallel`

(for Windows) package in R. The number of cores used in this case is equal to the number of cores of your CPU minus one.

Davydenko, Andrey, and Robert Fildes. 2013. “Measuring Forecasting Accuracy: The Case of Judgmental Adjustments to Sku-Level Demand Forecasts.” *International Journal of Forecasting* 29 (3). Elsevier B.V.: 510–22. https://doi.org/10.1016/j.ijforecast.2012.09.002.

Petropoulos, Fotios, and Nikolaos Kourentzes. 2015. “Forecast combinations for intermittent demand.” *Journal of the Operational Research Society* 66 (6). Nature Publishing Group: 914–24. https://doi.org/10.1057/jors.2014.62.

Svetunkov, Ivan, and Fotios Petropoulos. 2018. “Old dog, new tricks: a modelling view of simple moving averages.” *International Journal of Production Research* 56 (18). Taylor & Francis: 6034–47. https://doi.org/10.1080/00207543.2017.1380326.

Tashman, Leonard J. 2000. “Out-of-sample tests of forecasting accuracy: An analysis and review.” *International Journal of Forecasting* 16 (4): 437–50. https://doi.org/10.1016/S0169-2070(00)00065-0.