rEDM: an R package for Empirical Dynamic Modeling and Convergent Cross-Mapping

Hao Ye

Adam Clark

Ethan Deyle

George Sugihara

2016-11-15

Introduction

Empirical dynamic modeling (EDM) is an emerging non-parametric framework for modeling nonlinear dynamic systems. In an ecological context, EDM has numerous applications, including forecasting population abundances, unraveling species interactions, and identifying causal drivers. In contrast to the conventional approach of fitting assumed model equations to data, EDM relies on the fact that ecosystems have dynamics that allows us to reconstruct attractors directly from time series. This approach (with minimal assumptions) makes EDM particularly suitable for studying ecosystems, which exhibit non-equilibrium dynamics (problematic for models that typically assume stable equilibria) and state-dependent behavior (interactions that change with system state). This guide is designed to introduce both the theory behind EDM, as well as providing practical examples for using the rEDM software package.

Logistics

In this workshop, we will be exclusively using the statistical language R.

There are numerous reasons for choosing a language, but the primary advantages of R are that it is freely available and open-source, and with much added functionality (by way of R’s package system) that facilitates data analysis, modeling, and statistics. These traits make it well-suited for Ecology, as evinced by several popular books on R aimed at Ecologists and at least 5 other R-based workshops at this conference.

Installing R

R can be obtained from the main CRAN website. In nearly all cases, you can simply download the appropriate installer for your operating system and run it to install R on your system. This installation includes 3 main components:

Installing RStudio

The R console is usable by itself, but many people find the interface of an Integrated Development Environment (IDE) to be more convenient. RStudio is one of the more popular ones for R, with many added features for project management, version control, etc. Although all of our code examples will run fine in base R, if you are new to R or do not yet have a preferred IDE, we recommend you give RStudio a try.

R packages and Installing Rcpp

R’s basic functionality is extended through packages. These extend the base functionality of R with additional functions, datasets, etc. Many packages are distributed through the Comprehensive R Archive Network (CRAN), though there are other, often more specialized, archives. For example, Bioconductor primarily focuses on high-throughput genomic data, but also distributes other packages. Many packages are also available independently on code-sharing sites, such as GitHub.

R packages can be installed manually by downloading the package file and using the R interface with the install.packages function. However, both the R console application and RStudio also have menu options to manage packages. Here, we will first install the Rcpp package, which allows R to interface with C++ code, and is required by rEDM, our R package for Empirical Dynamic Modeling.

Installing rEDM

rEDM is an Rcpp package, and contains both C++ and R code. Because the C++ code needs to be compiled prior to usage, we present several different options for obtaining and installing the rEDM package, depending on your preference.

Binary Version

Most users will want the precompiled binary version. This can be downloaded from Github here. Please be sure to download the latest version (0.3.1 as of October 17, 2015). Clicking on the “view the full file” link should initiate a downlaod of the .tgz or .zip file, which you can save anywhere you like. Note that you do not want to unpack the .tgz or .zip file, since R expects a single package file, rather than a folder.

To install the downloaded package, use the standard R command (install.packages) as below. This should open a dialog window where you can select the downloaded package file (.tgz or .zip) to install.

install.packages(file.choose(), type = "source", repos = NULL)

Source Version

For users who prefer to download and compile the source version of rEDM, a raw source version is available from Github here. Note that this approach requires a C++11 compiler, in addition to the Rcpp package. We have successfully tested this method of installation using both Rtools 3.1 for Windows and XCode 5.0+ for Macintosh.

An alternative option for users familiar with using Git with Rstudio projects is to clone the rEDM package directly.

A third option uses the install_github function from the devtools package to download and install directly from Github.

library(devtools)
install_github("ha0ye/rEDM")

Empirical Dynamic Modeling

Time Series as Observations of a Dynamic System

The essential concept behind Empirical Dynamic Modeling (EDM) is that time series can be viewed as projections of the behavior of a dynamic system. This framework requires only a few modest assumptions:

Projecting the system state to an axis then gives the value of the corresponding state variable, and sequential projections over time produce a time series. Different time series observed from the system can capture different state variables, but are more generally, some function of the system state and may convolve several different state variables.

Time Series Projection from Lorenz Attractor

Attractor Reconstruction / Takens’ Theorem

To reproduce this fundamental, geometric view of the system, one might suppose that time series of all the state variables are required. However, Takens’ Theorem (Takens 1981) states that a mathematically equivalent reconstruction can be created by substituting lags of a time series for the unknown or unobserved variables.

In other words, instead of representing a system state as being composed of multiple different state variables, we instead use a lagged-coordinate embedding: \[ \vec{x}_t = \langle x_t, x_{t-\tau}, \dots, x_{t-(E-1)\tau} \rangle \]

Attractor Reconstruction

If sufficient lags are used, the reconstruction preserves essential mathematical properties of the original system. For instance, the points will map one-to-one to actual system states, and nearest neighbors in the reconstruction correspond to similar system states and behave similarly in the near future.

Nearest Neighbor Forecasting using Simplex Projection

One application of the reconstructed attractor is prediction. This can be accomplished using nearest neighbor forecasting methods, because of the similar behavior of nearby points in the reconstruction. One such method is simplex projection (Sugihara and May 1990). Simplex Projection is implemented in rEDM as the function simplex, and can be used both for prediction or to identify the optimal embedding dimension by quantifying the forecast skill of reconstructions with different dimensionality.

Example

First, we load the data and look at the format. Because this dataset is part of the rEDM package, we need to first load the package into R, before we have access to its datasets and functions.

library(rEDM)
data(tentmap_del)
head(tentmap_del)
## [1] -0.0992003 -0.6012986  0.7998003 -0.7944096  0.7979992 -0.8195405

We can see that the data consists of just a single vector, containing the raw data (first-differences from a tentmap). Because the simplex function can accept a single vector as the input time series, we don’t need further processing of the data. Furthermore, because the data come from a discrete-time model, we can let many of the parameters be default values (e.g., \(\tau = 1\), \(\text{tp} = 1\)). The default values for the embedding dimension, \(E\), range from \(1\) to \(10\), and so the output will allow us to determine which embedding dimension best unfolds the attractor.

We need to specify what portions of the data to use for constructing the simplex projection model, and what portions to use for testing the forecast skill. By default, simplex will use leave-one-out cross-validation over the entire time series, but because the data contain no observational noise, and are particularly long, we’d like to be more conservative.

lib <- c(1, 100)
pred <- c(201, 500)

This will use the first 100 points (time = 1 to 100) in the time series as the “library” to construct the model, and 300 points (time = 201 to 500) as the “prediction set” to test the model.

Note that if the code detects any overlap in the lib and pred, it will enable leave-one-out cross-validation and return a warning message.

ts <- tentmap_del
simplex_output <- simplex(ts, lib, pred)

The results are a simple data.frame with columns for each of the model parameters and forecast statistics, and rows for each run of the model. In this case, there is one run for each value of \(E\), so we can simply plot \(E\) against \(\rho\), the correlation between observed and predicted values:

par(mar = c(4, 4, 1, 1), mgp = c(2.5, 1, 0))
plot(simplex_output$E, simplex_output$rho, type = "l", xlab = "Embedding Dimension (E)", 
    ylab = "Forecast Skill (rho)")

Prediction Decay

An important property of deterministic chaos is that nearby trajectories eventually diverge over time (the so-called “butterfly effect”). This means that prediction is primarily limited to short-term forecasts, because over the long-term, the system state may be viewed as essentially random. This property also differentiates nonlinear systems from the equilibrium view, where the system can be expected to settle around a stable point.

Example

We can test for this property by adjusting the tp parameter in the models, which determines how far into the future the model seeks to predict:

simplex_output <- simplex(ts, lib, pred, E = 2, tp = 1:10)

Here, we can simply plot \(tp\) against \(\rho\) to see how forecast accuracy changes as we predict further and further into the future:

par(mar = c(4, 4, 1, 1))
plot(simplex_output$tp, simplex_output$rho, type = "l", xlab = "Time to Prediction (tp)", 
    ylab = "Forecast Skill (rho)")

Identifying Nonlinearity

One concern is that many time series may show predictability even if they are purely stochastic, because they behave similarly to autocorrelated red noise. Luckily, there are additional tests that can be done to distinguish between red noise and deterministic behavior, by quantifying the degree of “nonlinearity” in the data.

Here, we use the S-map forecasting method, that is based on fitting local linear maps for prediction instead of the nearest-neighbor interpolation of simplex projection (Sugihara 1994). In addition to the parameters for simplex projection, S-map also contains a nonlinear tuning parameter, \(\theta\) that affects the weights associated with individual points when fitting the local linear map. When \(\theta = 0\), all weights are equal, and the S-map is identical to an autoregressive model; values of \(\theta\) above \(0\) give greater weight to nearby points in the state space, thereby accommodating nonlinear behavior by allowing the local linear map to vary in state-space. For autoregressive red noise, the linear model should perform better, because the S-map model can reduce observation error by averaging over many points instead of just the most nearby points.

Thus, varying \(\theta\) allows us to compare equivalent linear and nonlinear models as a test for nonlinear dynamics (after first using simplex projection to estimate the optimal embedding dimension for a time-series.)

Example

Following from the previous example, we set E = 2 based on the results from simplex projection. Again, note that we allow many of the parameters take on default values (e.g., \(\tau = 1\), \(\text{tp} = 1\)). If we had changed these for simplex projection, we would want to propagate them here. The default values for the nonlinear tuning parameter, \(\theta\), range from \(0\) to \(8\), and are suitable for our purposes.

Note also, that the default value for num_neighbors is 0. Typically, when using s_map to test for nonlinear behavior, we allow all points in the reconstruction to be used, subject only to the weighting based on distance. By using 0 for this parameter (an otherwise nonsensical value), the program will use all nearest neighbors.

smap_output <- s_map(ts, lib, pred, E = 2)

Again, the results are a simple data.frame with columns for each of the model parameters and forecast statistics, and rows for each run of the model. In this case, there is one run for each value of \(\theta\), so we can simply plot \(\theta\) against \(\rho\):

par(mar = c(4, 4, 1, 1), mgp = c(2.5, 1, 0))
plot(smap_output$theta, smap_output$rho, type = "l", xlab = "Nonlinearity (theta)", 
    ylab = "Forecast Skill (rho)")