CoSMoS - A Complete Stochatic Modelling Solution

Filip Strnad, Simon Papalexiou, Yannis Markonis

2019-05-08


Vignette with figures can be found here.


Hydroclimatic processes come in all “shapes and sizes”. They are characterized by different spatiotemporal correlation structures and probability distributions that can be continuous, mixed-type, discrete or even binary. Simulating such processes by reproducing precisely their marginal distribution and linear correlation structure, including features like intermittency, can greatly improve hydrological analysis and design. Traditionally, modelling schemes are case specific and typically attempt to preserve few statistical moments providing inadequate and potentially risky distribution approximations.

Here, a single framework is proposed that unifies, extends, and improves a general-purpose modelling strategy, based on the assumption that any process can emerge by transforming a specific “parent”" Gaussian process. A novel mathematical representation of this scheme, introducing parametric correlation transformation functions, enables straightforward estimation of the parent-Gaussian process yielding the target process after the marginal back transformation, while it provides a general description that supersedes previous specific parameterizations, offering a simple, fast and efficient simulation procedure for every stationary process at any spatiotemporal scale.

This framework, also applicable for cyclostationary and multivariate modelling, is augmented with flexible parametric correlation structures that parsimoniously describe observed correlations. Real-world simulations of various hydroclimatic processes with different correlation structures and marginals, such as precipitation, river discharge, wind speed, humidity, extreme events per year, etc., as well as a multivariate example, highlight the flexibility, advantages, and complete generality of the method.

From Papalexiou (2018).


1 Step by step guide

The implementation of the stochastic simulation framework follows a few simple steps. Our aim is to generate a time series from a given marginal distribution, autocorrelation function and probability zero. The latter is the probability that a sample value is zero and is commonly used in modelling variables with a substantial number of zero values, such as daily precipitation (intermittent processes).

library(CoSMoS)

1.1 Define marginal distribution

Marginal distributions have different number of parameters and different type of parameters (location, scale, shape). For example, the Generalized Gamma distribution has one scale and two shape parameters, e.g., margarg = list(scale = 2, shape1 = 0.9, shape2 = 0.8).

In general, CoSMoS supports all build-in distribution functions of R and of other packages, but also comes with some build-in distributions such as Generalized Gamma or Pareto type II. A list of the CoSMoS distributions with their parameters can be found in Section 3.

Hence, the first step of stochastic simulation with CoSMoS is to set the theoretical distribution that we will simulate, as well as its parameters:

marginaldist <- 'ggamma'
param <- list(scale = 1,
              shape1 = .8,
              shape2 = .8)

1.2 Generate uncorrelated time series

The simplest case is to generate a time series (1000 values) with zero autocorrelation.

This can be done using the random generation rggamma (similarly to rnorm).

no <- 1000
ggamma_sim <- rggamma(n = no, 
                      scale = 1, 
                      shape1 = 1, 
                      shape2 = .5)
plot.ts(ggamma_sim, main = "")
par(mfrow = c(1, 2))
plot(density(ggamma_sim), main = "")
acf(ggamma_sim, main = "")

1.3 Autocorrelation

We can introduce the autocorrelation structure in two ways. Either by using specific lag autocorrelations as a list starting from lag 0 up to a desired lag, or by using a pre-defined parametric autocorrelation structure.

For instance, a time series with lag-1 autocorrelation \(\rho_1 = 0.8\) can be generated by:

acf <- c(1, 0.8)
ggamma_sim <- generateTS(n = no,
                         margdist = marginaldist, 
                         margarg = param,
                         acsvalue = acf)
plot(ggamma_sim, main = "")
par(mfrow = c(1, 2))
plot(density(ggamma_sim[[1]]), main = "")
acf(ggamma_sim[[1]], main = "")

Similarly, we can add more autocorrelation coeficients for a higher number of lags.

acf <- c(1, 0.5, 0.5, 0.4, 0.4) #up to lag-4
ggamma_sim <- generateTS(n = no,
                         margdist = marginaldist, 
                         margarg = param,
                         acsvalue = acf)
plot(ggamma_sim, main = "")
par(mfrow = c(1, 2))
plot(density(ggamma_sim[[1]]), main = "")
acf(ggamma_sim[[1]], main = "")

Another, more sophisticated approach is to use a function that creates the autocorrelation structure. This can be done be the acs() function or the user can create his own function. In the case of acs(), four types of autocorrelation functions are provided:

The acs() can produce values of the linear autocorrelation according to this four structures up to a desired lag. Thus, instead of setting each autocorrelation coefficient explicitly, we can fit a model with one to four arguments depending on the selected structure.

## specify lag
lags <- 0:10

## get the ACS
f <- acs('fgn', t = lags, H = .75)
b <- acs('burrXII', t = lags, scale = 1, shape1 = .6, shape2 = .4)
w <- acs('weibull', t = lags, scale = 2, shape = 0.8)
p <- acs('paretoII', t = lags, scale = 3, shape = 0.3)

## visualize the ACS
dta <- data.frame(lags, f, b, w, p)

m.dta <- melt(dta, id.vars = 'lags')

ggplot(m.dta,
       aes(x = lags,
           y = value,
           group = variable,
           colour = variable)) +
  geom_point(size = 2.5) +
  geom_line(lwd = 1) +
  scale_color_manual(values = c('steelblue4', 'red4', 'green4', 'darkorange'),
                     labels = c('FGN', 'Burr XII', 'Weibull', 'Pareto II'),
                     name = '') +
  labs(x = bquote(lag ~ tau),
       y = 'Acf') +
  scale_x_continuous(breaks = t) +
  theme_grey()

Back to our example, we can implement a 30-lag autocorrelation structure as:

acf = acs(id = 'paretoII',
          t = 0:30,
          scale = 1,
          shape = .75)

ggamma_sim <- generateTS(n = no,
                         margdist = marginaldist, 
                         margarg = param,
                         acsvalue = acf)
par(mfrow = c(1, 1))
plot(ggamma_sim, main = "")

Note that in this case we use only 2 parameters for autocorrelation and not 30, which increases our simulation parsimony. More details about the parametric autocorrelation structures can be found in section 3.2 in Papalexiou (2018).

Except for the four autocorrelation functions provided by acs(), we can also create the our own:

my_acf <- exp(seq(0, -2, -0.1))
ggamma_sim <- generateTS(n = no,
                         margdist = marginaldist, 
                         margarg = param,
                         acsvalue = my_acf)
par(mfrow = c(1, 1))
plot(ggamma_sim, main = "")
par(mfrow = c(1, 2))
acf(ggamma_sim[[1]])

1.4 Intermittency

If we wish to simulate an intermittent process (e.g., precipitation), then we have to define the probability zero. For example, if p0 = 0.9, then generated data shall have 90% of zero values (dry days). In addition, we can simulate multiple processes at once by using argument TSn = 5.

prob_zero <- .9
ggamma_sim <- generateTS(n = no, 
                         margdist = marginaldist,
                         margarg = param,
                         acsvalue = acf,
                         p0 = prob_zero,
                         TSn = 5)
par(mfrow = c(1, 1))
plot(ggamma_sim, main = "")

1.5 Validating the results

The efficiency of the generated time series can be validated with checkTS(). The first three moments, probability zero and the first three autocorrelation coefficients are printed.

checkTS(ggamma_sim)

2 Stochastic Simulation of real-world processes

2.1 Hourly precipitation

In practice, we usually want to simulate a natural process using some sampled time series. A typical hydroclimatic variable, also available in CoSMoS package is daily precipitation.

data('precip')
plot(precip, type = 'l')
par(mfrow = c(1, 2))
plot(density(precip$value), main = "", xlim = c(0, 10)) #Does not plot extreme values
acf(precip$value, main = "", lag.max = 20)

To generate a synthetic time series with similar characteristics to the observed precipitation, we have to determine their marginal distribution, autocorrelation structure and probability zero for each individual month. This can be done by trying to fit various distributions and autocorrelation structures with analyzeTS() and then check the goodness-of-fit with reportTS().

precip_ggamma <- analyzeTS(TS = precip, 
                                   season = 'month',
                                   dist = 'ggamma',
                                   acsID = "weibull",
                                   lag.max = 12)
reportTS(precip_ggamma, 'dist')
reportTS(precip_ggamma, 'acs')
reportTS(precip_ggamma, 'stat')

In this example, the Generalized Gamma distribution, with a Weibull auto-correlation structure, appears to fit well to the observered time series. Other combinations might not give so good results. For instance:

precip_pareto <- analyzeTS(TS = precip, 
                                   season = 'month',
                                   dist = 'paretoII',
                                   acsID = 'fgn',
                                   lag.max = 12)
reportTS(precip_pareto, 'dist')
reportTS(precip_pareto, 'acs')

When the marginal distribution and the autocorrelation structure with the best fit are determined, we can produce a synthetic time series with the same statistical properties with simulateTS() for a given period.

sim_precip <- simulateTS(aTS = precip_ggamma, 
                         from = as.POSIXct('1978-12-01 00:00:00'),
                         to = as.POSIXct('2008-12-01 00:00:00'))
dta <- precip
dta[, id := 'observed']
sim_precip[, id := 'simulated']

dta <- rbind(dta, sim_precip)

ggplot(dta) +
  geom_line(aes(x = date, y = value)) +
  facet_wrap(~id, ncol = 1) +
  theme_classic()

2.2 Daily streamflow

We can also apply the same methods to a different dataset - daily streamflow! For this example we have used random dataset from the MOPEX database (https://www.nws.noaa.gov/oh/mopex/mo_datasets.htm).

id <- '02135000'
dta_raw <- as.data.table(read.fwf(sprintf(
  'ftp://hydrology.nws.noaa.gov/pub/gcip/mopex/US_Data/Us_438_Daily/%s.dly', id), 
                                  widths = c(8,10,10,10,10,10),
                                  col.names = c('date', 'P', 'E', 'value', 'Tmax', 'Tmin')))

dta_raw[, date := as.POSIXct(gsub(' ','0', date), format = '%Y%m%d')]

daily_streamflow <- dta_raw[value >= 0, .(date, value)]

str_burr <- analyzeTS(daily_streamflow,
                      dist = 'ggamma',
                      norm = 'N4',
                      acsID = 'paretoII',
                      lag.max = 50)

sim_str <- simulateTS(str_burr)

dta <- daily_streamflow
dta[, id := 'observed']
sim_str[, id := 'simulated']

dta <- rbind(dta, sim_str)

ggplot(dta) +
  geom_line(aes(x = date, y = value)) +
  facet_wrap(~id, ncol = 1) +
  theme_classic()

Recomended distributions for variables:

3 Available distributions in CoSMoS package

Package CoSMoS supports majority of continuous distributions available in R with defined quantile and probability density functions. We also provide four widely used distributions - Burr type III and XII, Generalized gamma and Pareto type II distribution.

3.1 Burr type III distribution

distribution argument/name - burrIII
parameters - list(scale, shape1, shape2)

\[f_{\text{BrIII}}(x)=\frac{\left(\frac{x}{\beta }\right)^{-\frac{1}{\gamma _2}-1} \left(\frac{\left(\frac{x}{\beta }\right)^{-\frac{1}{\gamma _2}}}{\gamma _1}+1\right){}^{-\gamma _1 \gamma _2-1}}{\beta } \\ F_{\text{BrIII}}(x)=\left(\frac{\left(\frac{x}{\beta }\right)^{-\frac{1}{\gamma _2}}}{\gamma _1}+1\right){}^{-\gamma _1 \gamma _2} \\ Q_{\text{BrIII}}(u)=\beta \left(\gamma _1 \left(u^{-\frac{1}{\gamma _1 \gamma _2}}-1\right)\right){}^{-\gamma _2} \\ m_{\text{BrIII}}(q)=\frac{\beta ^q \gamma _1^{\gamma _2 (-q)} \Gamma \left(\left(q+\gamma _1\right) \gamma _2\right) \Gamma \left(1-q \gamma _2\right)}{\Gamma \left(\gamma _1 \gamma _2\right)}\]

3.2 Burr type XII distribution

distribution argument/name - burrXII
parameters - list(scale, shape1, shape2)

\[f_{\text{BrXII}}(x)=\frac{\left(\frac{x}{\beta }\right)^{\gamma _1-1} \left(\gamma _2 \left(\frac{x}{\beta }\right)^{\gamma _1}+1\right){}^{-\frac{1}{\gamma _1 \gamma _2}-1}}{\beta } \\ F_{\text{BrXII}}(x)=1-\left(\gamma _2 \left(\frac{x}{\beta }\right)^{\gamma _1}+1\right){}^{-\frac{1}{\gamma _1 \gamma _2}} \\ Q_{\text{BrXII}}(u)=\beta \left(-\frac{1-(1-u)^{-\gamma _1 \gamma _2}}{\gamma _2}\right){}^{\frac{1}{\gamma _1}} \\ m_{\text{BrXII}}(q)=\frac{\beta ^q \gamma _2^{-\frac{q}{\gamma _1}-1} B\left(\frac{q+\gamma _1}{\gamma _1},\frac{1-q \gamma _2}{\gamma _1 \gamma _2}\right)}{\gamma _1}\]

3.3 Generalized gamma distribution

distribution argument/name - ggamma
parameters - list(scale, shape1, shape2)

\[f_{\mathcal{G}\mathcal{G}}(x)=\frac{\gamma _2 e^{-\left(\frac{x}{\beta }\right)^{\gamma _2}} \left(\frac{x}{\beta }\right)^{\gamma _1-1}}{\beta \Gamma \left(\frac{\gamma _1}{\gamma _2}\right)} \\ F_{\mathcal{G}\mathcal{G}}(x)=Q\left(\frac{\gamma _1}{\gamma _2},0,\left(\frac{x}{\beta }\right)^{\gamma _2}\right) \\ Q_{\mathcal{G}\mathcal{G}}(u)=\beta Q^{-1}\left(\frac{\gamma _1}{\gamma _2},0,u\right){}^{\frac{1}{\gamma _2}} \\ m_{\mathcal{G}\mathcal{G}}(q)=\frac{\beta ^q \Gamma \left(\frac{q}{\gamma _2}+\frac{\gamma _1}{\gamma _2}\right)}{\Gamma \left(\frac{\gamma _1}{\gamma _2}\right)}\]

3.4 Paretto type II distribution

distribution argument/name - burrIII
parameters - list(scale, shape)

\[f_{\text{PII}}(x)=\frac{\left(\frac{\gamma x}{\beta }+1\right)^{-\frac{1}{\gamma }-1}}{\beta } \\ F_{\text{PII}}(x)=1-\left(\frac{\gamma x}{\beta }+1\right)^{-1/\gamma } \\ Q_{\text{PII}}(u)=\frac{\beta \left((1-u)^{-\gamma }-1\right)}{\gamma } \\ m_{\text{PII}}(q)=\frac{\Gamma (q+1) \left(\frac{\beta }{\gamma }\right)^q \Gamma \left(\frac{1}{\gamma }-q\right)}{\Gamma \left(\frac{1}{\gamma }\right)}\]

4 References

Papalexiou, S.M., 2018. Unified theory for stochastic modelling of hydroclimatic processes: Preserving marginal distributions, correlation structures, and intermittency. Advances in Water Resources 115, 234-252. link