Introduction to gratis

Bocong Zhao

About gratis

The gratis package indicates generating time series with diverse and controllable characteristic. It is a new efficient and general approach, based on gaussian mixture autoregressive (MAR) models to generate a wide range of non-gaussian and nonlinear time series.

Our generated dataset can be used as diversifiable and controllable benchmarking data in the time series domain. And it can apply as an algorithm evaluation tool for tasks such as time series forecasting and classification with a minimal input of human efforts and computational resources.

Introduction of gratis mechanism

Based on simulate time series data with mixture autoregressive model, gratis can coverage generalise time series and investigate the diversity in a time series feature space.

Furthermore, by tuning parameters of mixture autoregressive model, gratis can also efficiently generate new time series and controllable features.

# load package
library(gratis)

Generate diverse time series

We use function generate_ts() to generate diverse time series

Our generation process use distributions instead of fixed parameter values in underlying models to allow generate diverse time series instances. The diversity of the generated time series should not rely on the parameter settings.

Definitions

Here are the definitions of parameter settings in function generate_ts():

parameter settings Definition
n.ts number of time series to be generated
freq seasonal period of the time series to be generated
nComp number of mixing components when simulating time series using MAR models
n length of the generated time series

Example

Suppose we want to use MAR model to generate 3 time series from random parameter spaces. Each time series has 12 seasonal periods, 2 mixing components and the length 120.

By setting the parameter output_format, generate_ts now has an option to transform their time series output into a tsibble format. Without setting the parameter, it would keep output as default setting, list format.

1.Generate diverse time series with ‘tsibble’ output format

generate_ts(n.ts = 3, freq = 12, nComp = 2, n = 120, output_format = "tsibble")
#> $N1
#> # A tsibble: 120 x 2 [1M]
#>       index value
#>       <mth> <dbl>
#>  1 0001 Jan  2.00
#>  2 0001 Feb  6.59
#>  3 0001 Mar  3.71
#>  4 0001 Apr  5.78
#>  5 0001 May  3.68
#>  6 0001 Jun  8.48
#>  7 0001 Jul  6.02
#>  8 0001 Aug 10.7 
#>  9 0001 Sep 10.6 
#> 10 0001 Oct 12.1 
#> # ... with 110 more rows
#> 
#> $N2
#> # A tsibble: 120 x 2 [1M]
#>       index  value
#>       <mth>  <dbl>
#>  1 0001 Jan -2.19 
#>  2 0001 Feb -4.79 
#>  3 0001 Mar -5.54 
#>  4 0001 Apr -5.74 
#>  5 0001 May -2.01 
#>  6 0001 Jun  2.51 
#>  7 0001 Jul -0.510
#>  8 0001 Aug  0.411
#>  9 0001 Sep -0.282
#> 10 0001 Oct -4.61 
#> # ... with 110 more rows
#> 
#> $N3
#> # A tsibble: 120 x 2 [1M]
#>       index value
#>       <mth> <dbl>
#>  1 0001 Jan  58.9
#>  2 0001 Feb  63.8
#>  3 0001 Mar  66.7
#>  4 0001 Apr  71.6
#>  5 0001 May  76.0
#>  6 0001 Jun  77.6
#>  7 0001 Jul  83.3
#>  8 0001 Aug  81.9
#>  9 0001 Sep  90.5
#> 10 0001 Oct  94.0
#> # ... with 110 more rows

Output

We can see 3 different time series be simulated, which are N1, N2 and N3. In this example we use time series N1 for further analysis.

As required, there are 2 mixing components when simulating time series using MAR models, which are pars1 and pars2

Each component stands for different weight.

2. Generate diverse time series with ‘list’ output format

x <- generate_ts(n.ts = 3, freq = 12, nComp = 2, n = 120,output_format = "list")
# N1 time series
x$N1$pars
#> $pars1
#> [1]  0.2199556  1.0722465  0.3962213 -0.3677716
#> 
#> $pars2
#> [1] -0.5187467  0.4471215  1.0660018 -0.8475915
#> 
#> $weights
#> [1] 0.7335474 0.2664526

Plot time series

# plot N1 time series
autoplot(x$N1$x)

Generate mutiple seasonal time series

Time series can exhibit multiple seasonal pattern of different length, especially when series observed at a high frequency such as daily or hourly data.

We use function generate_msts() to generate mutiple seasonal time series.

Definitions

Here are the definitions of parameter settings in function generate_msts():

parameter settings Definition
seasonal.periods a vector of seasonal periods of the time series to be generated
nComp number of mixing components when simulating time series using MAR models
n length of the generated time series

Example

Suppose we want to use MAR model to generate a time series with 2 mixing components and the length 800 from random parameter spaces. Particularly, this time series has two seasonal periods 7 and 365.

By setting the parameter output_format, generate_msts now has an option to transform their time series output into a tsibble format. Without setting the parameter, it would keep output as default setting, list format.

1. Generate mutiple seasonal time series with ‘tsibble’ output format

generate_msts(seasonal.periods = c(7, 365), n = 800, nComp = 2,output_format="tsibble")
#> # A tsibble: 800 x 2 [9.99999999999699e-07]
#>    index  value
#>    <dbl>  <dbl>
#>  1  1    -1.32 
#>  2  1.00 -0.891
#>  3  1.01 -0.800
#>  4  1.01 -1.21 
#>  5  1.01 -0.894
#>  6  1.01 -1.48 
#>  7  1.02 -1.18 
#>  8  1.02 -1.24 
#>  9  1.02 -1.07 
#> 10  1.02 -1.14 
#> # ... with 790 more rows

2. Generate mutiple seasonal time series with ‘list’ output format

x <- generate_msts(seasonal.periods = c(7, 365), n = 800, nComp = 2,output_format="list")

Plot time series

autoplot(x)

Generate time series with controllable features

Time series analysis with particular focus may only interested in a certain area of feature space or a subset of features.

Our function generate_ts_with_target() can efficiently generate time series with target features.

The principle behind is that we use genetic algorithms to tune MAR parameters until the distance between target feature vector and feature vector of a sample of time series simulated from MAR is approximately equal to 0.

Definitions

Here are the definitions of parameter settings in function generate_ts_with_target ():

parameter settings Definition
n number of time series to be generated
ts.length length of the time series to be generated
freq frequency of the time series to be generated
seasonal 0 for non-seasonal data, 1 for single-seasonal data, and 2 for multiple seasonal data
features a vector of function names
selected.features selected features to be controlled
target target feature values
parallel An optional argument which allows to specify if the Genetic Algorithm should be run sequentially or in parallel

Example

Suppose we want to use MAR model to generate 1 non-seasonal data time series with frequency 1 and the length 60. Particularly, this time series has two selected features, entropy and trend with target value between 0.6 to 0.9

By setting the parameter output_format, generate_ts_with_target now has an option to transform their time series output into a tsibble format. Without setting the parameter, it would keep output as default setting, list format.

1. Generate mutiple seasonal time series with ‘tsibble’ output format

generate_ts_with_target(
  n = 1, ts.length = 60, freq = 1, seasonal = 0,
                        features = c('entropy', 'stl_features'),
                      selected.features = c('entropy', 'trend'),
                        target = c(0.6, 0.9),  
                        parallel=FALSE,
                        output_format = "tsibble"
                        )
#> GA | iter = 1 | Mean = -23.54402483 | Best =  -0.04530665
#> # A tsibble: 60 x 2 [1]
#>    index   value
#>    <dbl>   <dbl>
#>  1     1 -0.875 
#>  2     2 -0.0970
#>  3     3 -0.485 
#>  4     4 -2.50  
#>  5     5 -3.00  
#>  6     6 -1.88  
#>  7     7 -0.995 
#>  8     8  0.303 
#>  9     9 -0.350 
#> 10    10 -0.518 
#> # ... with 50 more rows

2. Generate mutiple seasonal time series with ‘list’ output format

x <- generate_ts_with_target(
  n = 1, ts.length = 60, freq = 1, seasonal = 0,
                        features = c('entropy', 'stl_features'),
                      selected.features = c('entropy', 'trend'),
                        target = c(0.6, 0.9),  
                        parallel=FALSE,
                        output_format = "list"
                        )
#> GA | iter = 1 | Mean = -16.92329368 | Best =  -0.08841775

Plot time series

autoplot(x)