roahd

Nicholas Tarabelloni

2021-11-03

1 Introduction

Package roahd (RObust Analysis of High dimensional Data) is an R package meant to gather recently proposed statistical methods to deal with the robust analysis of functional data (Francesca Ieva et al. 2019).

The package contains an implementation of quantitative methods, based on functional depths and indices, on top of which are built some graphical methods (functional boxplot and outliergram) useful to carry out an explorative analysis of a functional dataset and to robustify it by discarding shape and magnitude outliers.

Both univariate and multivariate functional data are supported in the package, whose functions have been implemented with a particular emphasis on computational efficiency, in order to allow the processing of high-dimensional dataset.

2 Representation of functional data

The package has been designed to work with a representation of functional data through dedicated, simple and handy S3 classes, namely fData for univariate functional data and mfData for multivariate functional data.

The use of S3 classes is exploited by suitable S3 methods implementing the statistical features of the package, that are able to dispatch the correct method call depending on the class of their functional data argument.

2.1 fData

S3 class fData implements a representation of univariate functional datasets. They are obtained once specifying, for each observation in the functional dataset, a set of measurements over a discrete grid, representing the dependent variable indexing the functional dataset (e.g. time).

In other words, if we denote by \(T = [t_0, t_1, \ldots, t_{P-1}]\) an evenly spaced grid (\(t_i - t_{i-1} = h > 0\)), and imagine to deal with a dataset \(D_{i,j} = X_i(t_j)\), \(\forall i = 1, \ldots, N\) and \(\forall j=0, \ldots, P-1\), the object fData is built starting from the grid and the values in the following way:

library( roahd )

# The number of observations in the functional dataset.
N = 5
# The number of points in the 1D grid where the functional data are measured.
P = 1e2
# The previous two variable names are used consistently throughout the tutorial
# and the package's documentation to indicate the number of observations and the
# grid size.

# The grid over which the functional dataset is defined
grid = seq( 0, 1, length.out = P )

# Creating the values of the functional dataset
Data = matrix( c( sin( 2 * pi * grid  ),
                  cos( 2 * pi * grid ),
                  4 * grid * ( 1 - grid ),
                  tan( grid ),
                  log( grid ) ),
              nrow = N, ncol = P, byrow = TRUE )

# Building an fData object
# The constructor takes a grid and a matrix-like structure for data values
# (see help for more details on how to use the constructor)
fD = fData( grid, Data )

# Inspecting the structure of an fData object
str( fD )
## List of 6
##  $ t0    : num 0
##  $ tP    : num 1
##  $ h     : num 0.0101
##  $ P     : int 100
##  $ N     : int 5
##  $ values: num [1:5, 1:100] 0 1 0 0 -Inf ...
##  - attr(*, "class")= chr "fData"
plot( fD, main = 'Univariate FD', xlab = 'time [s]', ylab = 'values', lwd = 2 )

Thus, an object fData is a list containing: the fields t0, tP, defining the starting and end point of the one dimensional grid of the fData object, the constant step size h and the number of grid points P and the field values defining the measurements of the dataset over the one dimensional grid.

2.2 mfData

An mfData object, instead, implements a multivariate functional dataset, i.e.  a collection of functions having more than one components, each one depending on the same variable. In practice, we deal with a discrete grid \([t_0, t_1, \ldots, t_{P-1}]\) and a dataset of \(N\) elements, each one having \(L\) components observed over the same discrete grid: \(D_{i,j,k} = X_{i,k}(t_{j})\), \(\forall i = 1, \ldots, N\), \(\forall j = 0, \ldots, P - 1\) and \(\forall k = 1, \ldots, L\).

# Creating some values for first component of the dataset
Data_1 = t( sapply( runif( 10, 0, 4 ), 
                    function( phase ) sin( 2 * pi * grid + phase ) ) )

# Creating some values of functions for  
Data_2 = t( sapply( runif( 10, 0, 4 ), 
                    function( phase ) log( grid + phase ) ) )

# Building an fData object
# The constructor takes a grid and a list of matrix-like structures for data values,
# each one representing the data values of a single component of the dataset 
# (i.e. D_{,,k}, k = 1, ... L ).
# (see help for more details on how to use the constructor)
mfD = mfData( grid, list( Data_1, Data_2 ) )

str( mfD )
## List of 6
##  $ N     : int 10
##  $ L     : int 2
##  $ P     : int 100
##  $ t0    : num 0
##  $ tP    : num 1
##  $ fDList:List of 2
##   ..$ :List of 6
##   .. ..$ t0    : num 0
##   .. ..$ tP    : num 1
##   .. ..$ h     : num 0.0101
##   .. ..$ P     : int 100
##   .. ..$ N     : int 10
##   .. ..$ values: num [1:10, 1:100] 0.852 0.843 0.172 0.374 0.881 ...
##   .. ..- attr(*, "class")= chr "fData"
##   ..$ :List of 6
##   .. ..$ t0    : num 0
##   .. ..$ tP    : num 1
##   .. ..$ h     : num 0.0101
##   .. ..$ P     : int 100
##   .. ..$ N     : int 10
##   .. ..$ values: num [1:10, 1:100] 0.32 -0.117 0.421 1.013 -0.581 ...
##   .. ..- attr(*, "class")= chr "fData"
##  - attr(*, "class")= chr "mfData"
# Each component of the mfData object is an fData object 
sapply( mfD$fDList, class )
## [1] "fData" "fData"
plot( mfD, lwd = 2, main = 'Multivariate FD',
      xlab = 'time', ylab = list( 'Values 1', 'Values 2' ))

The fact that mfData components are fData objects is indeed conceptually very natural, but also allows for a seamless application of S3 methods meant for fData on multivariate functional data components, making the exploration and manipulation of multivariate datasets rather easy:

plot( mfD$fDList[[ 1 ]], main = 'First component', 
      xlab = 'time', ylab = 'Values', lwd = 2 )

Moreover, mfData objects can be obtained also from a set of homogeneous fData objects, i.e. of equal sample size and defined on the same grid:

fD_1 = fData( grid, Data_1 )

fD_2 = fData( grid, Data_2 )

mfD = as.mfData( list( fD_1, fD_2 ) ) 

mfD = as.mfData( lapply( 1 : 10, function( i )( fD_1 ) ) )

2.3 Subsetting fData

fData objects can be subset using a suitably overloaded operator [.fData, that allows for the use of standard slices of matrix and array classes also for fData.

# Subsetting fData and returning result in matrix form
fD[ 1 , 1, as_fData = FALSE ]

fD[ 1, , as_fData = FALSE ]

fD[ 2, 10 : 20, as_fData = FALSE ]

fD[ , 10, as_fData = FALSE ]

# As default behavior the subset is returned in fData form
oldpar <- par(mfrow = c(1, 1))
par(mfrow = c(1, 2))
plot(fD, main = "Original dataset", lwd = 2)
plot(fD[, 1:20], main = "Zooming in", lwd = 2)
par(oldpar)

2.4 Algebra

An algebra of fData objects is also implemented, making it easy to sum, subtract, multiply and divide these objects by meaningful and compliant structures.

Sums and subtractions, available through + and - operators (see help at +-.fData), allow to sum an fData on the left hand side and a compliant structure on the right hand side. This can be either another fData of the same sample size and defined over the same grid, or a 1D/2D data structure with a number of columns equal fData’s grid length (i.e. P), and number of rows equal to fData’s sample size (i.e. N) or equal to one (in this case the only observation available is recycled N times). The operations are then performed element-wise between the lhs and rhs.

fD + fD

fD + matrix( 1, nrow = N, ncol = P )

fD + array( 2, dim = c( N, P ) )

fD + 1 : P 

Multiplication and division, instead, is implemented only for an fData left hand side and a numeric variable or numeric vector right hand side. In the first case, each function in the functional dataset is multiplied/divided by the specified quantity; in the second case, specifying a vector of length N, the multiplication/division of each functional observation is carried out by the corresponding quantity in the vector, in an element-wise way:

fD * 2

fD / 3

fD * ( 1 : N )

fD / ( 1 : N )

2.5 Visualization

fData and mfData objects can be visualized thanks to specific S3 plotting methods, plot.fData and plot.mfData.

The graphical parameters of these functions have been suitably customized in order to enhance the visualization of functions. In particular, elements are plotted by default with continuous lines and an ad-hoc palette that helps differentiating them. As default x- and y-axis labels, as well as titles, are dropped so that plot’s arguments calls are not displayed when no value is provided.

In case of mfData the graphical window is split into a rectangular lattice to plot single dimensions. The rectangular frame has floor( sqrt( L ) ) rows and ceiling( x$L / floor( sqrt( x$L ) ) ) columns. If custom labels/titles are desired, they must be provided in the following way: since the grid is the same for all the dimensions, just one string is expected for x-axis (e.g. xlab = 'grid'), while either a single string or a list of L strings (one for each dimension) is expected for both the y-axis label(s) and title(s). In case just one string is passed to plot.mfData, the same value is used for all the dimensions.

3 Statistics

3.1 Depths

roahd provides a number of depth definitions, that are exploited by the visualization functions but can also be used by themselves. These are based on the notion of Band Depth (López-Pintado and Romo 2007, 2009).

Band Depth and Modified Band Depth are implemented in the functions BD and MBD. Both of them can work with either an fData object, specifying the univariate functional dataset whose depths must be computed, or a matrix of data values (e.g. in the form of fData$values output). MBD can be called with the additional parameters manage_ties (defaulting to FALSE), specifying whether a check for the presence of tied data must be made prior to computing depths, and therefore a suitable computing strategy must be used. The implementation of MBD exploits the recommendations of (Sun, Genton, and Nychka 2012), but extends them in order to accommodate for the possible presence of ties.

  BD( fD )
  BD( fD$values )
  
  MBD( fD )
  MBD( fD$values )
  
  MBD( fD, manage_ties = TRUE )
  MBD( fD$values, manage_ties = TRUE )

Another definition available is the Half Region Depth (along with its modified version), that is built on top of the Epigraph and Hypograph indexes (see López-Pintado and Romo 2011):

HRD( fD )
HRD( fD$values )

MHRD( fD )
MHRD( fD$values )

A generalization of MBD to multivariate functional data, implementing the ideas of F. Ieva and Paganoni (2013), is also available through the functions multiBD and multiMBD. These functions accept either a mfData object, specifying the multivariate functional dataset whose depths have to be computed, or a list of matrices of data values (see example below).

The function computes the BD or MBD for each component of the multivariate functional dataset and then averages them according to a set of weights. These can be specified in two ways: either with the flag uniform, that is turned into rep( 1/L, L ) (where L stands for the number of components in the multivariate dataset), or by providing the actual set of weights to be used. The latter option allows to use ad-hoc set of weights, like in (Tarabelloni et al. 2015).

  multiBD( mfD, weights = 'uniform' )
  multiMBD( mfD, weights = 'uniform', manage_ties = FALSE )

  multiBD( mfD, weights = c( 0.6, 0.4) )
  multiMBD( mfD, weights = c( 0.7, 0.3 ), manage_ties = FALSE )

  multiBD( list( fD_1$values, fD_2$values ), weights = c( 0.6, 0.4) )
  multiMBD( list( fD_1$values, fD_2$values ), weights = c( 0.7, 0.3 ), manage_ties = FALSE )

3.2 Mean and median

Suitable S3 extensions to the functions for the computation of mean function and median of functional datasets are also present.

The sample mean of a functional dataset coincides with the cross-sectional mean function, i.e. the function obtained by computing the mean across the whole dataset point-by-point along the grid where functional data are defined:

\[ \widehat{\mu}( t_j ) = \dfrac{1}{N} \sum_{i=1}^{N} X_i(t_j), \qquad \forall j = 0, \ldots, P-1 \]

for a univariate functional dataset, while in the multivariate case is:

\[ \widehat{\mu}_k(t_j) = \dfrac{1}{N} \sum_{i=1}^{N} X_{i,k}(t_j), \qquad \forall j = 0, \ldots, P-1, \qquad \forall k = 1, \ldots, L. \]

The sample median of a functional dataset, instead, is defined as the element of the functional dataset fulfilling the maximum depths, given a certain definition of depth. For instance, for MBD:

\[ \widehat{m}( t_j ) = \left(\text{arg}\max_{i=1, \ldots, N} MBD( X_i )\right)( t_j), \qquad j = 0, \ldots, P-1. \]

The sample mean is implemented in the S3 methods mean.fData( x, ... ) and mean.mfData( x, ... ), that can be invoked directly on fData and mfData objects. These methods can be called directly without the .fData or .mfData suffix, due to their S3 nature and call, that allows them to be dispatched from the standard mean function. The sample median, instead, is implemented in the functions median_fData( fData, type = 'MBD', ... ) and median_mfData( mfData, type = 'multiMBD', ... ). Here, the type flag can take any name of functions (available in the caller’s environment) that can be used to compute the depths defining the sample median, taking respectively a 2D matrix of data values and a list of 2D matrix of data values as argument (plus, optionally, the dots arguments).

  # Exploiting the S3 nature of these functions and the dispatching from the 
  # standard `mean` function
  mean( fD )
  mean( mfD)
  
  median_fData( fD, type = 'MBD' )
  median_fData( fD, type = 'MHRD' )

  oldpar <- par(mfrow = c(1, 1))
  par(mfrow = c(1, 2))
  
  plot(fD, main = "Mean", lwd = 2)
  plot(mean(fD), add = TRUE, lwd = 2, col = "black", lty = 2)
  
  plot(fD, main = "Median", lwd = 2)
  plot(median_fData(fD, type = "MBD"), add = TRUE, lwd = 2, lty = 2, col = "black")
  
  par(oldpar)

3.3 Covariance and cross-covariance functions

The computation of covariance functions and cross-covariance functions for univariate and multivariate functional datasets is provided through the function cov_fun( X, Y = NULL ).

Given a univariate functional dataset, \(X_1, X_2, \ldots, X_N\), defined over the grid \([t_0, t_1, \ldots, t_P] \subset I\), its covariance function (evaluated over the grid) is \(C(t_i,t_j) = Cov(X(t_i),X(t_j))\) for \(i,j=1,\ldots,P\). Given another univariate functional dataset \(Y_1, Y_2, \ldots, Y_N\), the cross-covariance function is \(C_{X,Y}(t_i,t_j) = Cov(X(t_i),Y(t_j))\).

Given a multivariate functional dataset with observations of \(L\) components, \(X_1, X_2,\ldots, X_N\), the covariance function has the following block structure: \(C^{k,l} = [ Cov(X_k(t_i),X_l(t_j))]_{i,j=1}^{P}\), for \(k,l = 1, \ldots, L\). Of course, it is \(C^{k,l} = [C^{l,k}]^T\). Analogously, the cross-covariance between two multivariate datasets is given by the cross covariances of the components.

When X is a univariate functional dataset, and if Y is NULL, cov_fun returns the sample covariance function of the functional dataset, defined over the tensorized grid where X is defined. If Y is a univariate functional dataset (in form of fData object), the method returns the cross-covariance function of X and Y.

When X is a multivariate dataset, Y can be NULL, an fData or a mfData object. In the first case the method returns the covariance function of X, in form of a list of only the upper-triangular blocks. The blocks are sorted in the list by row, therefore the first is the covariance of the first component, then the cross-covariance of the first component with the second, etc. In the second case, the method returns the list of cross-covariances between X’s components and Y. In the third case, the method returns the list of upper-triangular blocks of cross-covariances between X’s and Y’s components.

    # Simple covariance function
    C1 = cov_fun( fD )

    # Cross-covariance function of first and second component of mfD
    CC = cov_fun( mfD$fDList[[1]], mfD$fDList[[2]] )

    # Block-covariance function of mfD
    BC = cov_fun( mfD )

Each covariance function estimate (the elements of the list in the multivariate case, too) is returned as an instance of the S3 class Cov, that stores the values of the covariance matrix as well as the grid parameters.

plot.Cov, an S3 specialization of plot is available as plotting method for Cov objects. It is built around graphics::image, hence all the additional parameters of image can be used to customize it.

  plot( C1, main = 'Covariance function', xlab = 'time', ylab = 'time' )

3.4 Indexes

roahd collects the implementation of some useful indexes that can be used to describe and summarize functional datasets.

EI and MEI implement the Epigraph Index and the Modified Epigraph Index, while HI and MHI implement the Hypograph Index and the Modified Hypograph Index (see López-Pintado and Romo 2011; Arribas-Gil and Romo 2014). These indexes can be used to sort data in a top-down and bottom-up fashion, and are used to define the HRD/MHRD and to build the outliergram.

These S3 methods can be called on univariate functional datasets, provided either in form of a fData object or a 2D matrix of values.

  # Calling on fData objects
  EI( fD )
  MEI( fD )
  
  HI( fD )
  MHI( fD )

  # Calling on 2D matrix type objects
  EI( fD$values )
  EI( matrix( rnorm( 20 ), nrow = 4, ncol = 5 ) )

3.5 Correlation coefficients

When dealing with multivariate functional data, in particular in case of bivariate data, it is possible to compute correlation coefficients between data components that generalize the Kendall’s tau and Spearman’s coefficients (Valencia, Romo, and Lillo 2015a, 2015b).

The function cor_kendall( mfD, ordering = 'max' ) allows to compute the Kendall’s tau correlation coefficient between components of a bivariate dataset. The function accepts a mfData object and a criterion to perform the ordering of functional data (this ordering is used to determine the concordances and discordances between pairs in the definition of the coefficient).

Two criteria are available so far, that directly reflect those proposed in the reference paper: max, for the ordering between maxima of functions, and area, for the ordering between area-under-the-curve of functions.

  N = 10
  P = 1e3

  grid = seq( 0, 1, length.out = P )
  
  Data_1 = t( sapply( 1 : N, function( i )( sin( 2 * pi * grid ) + i ) ) )
 
  # Monotone nonlinear transformation of data
  Data_2 = Data_1^3
  
  mfD = mfData( grid, list( Data_1, Data_2 ) )
  
  plot( mfD, main = list( 'Comp. 1', 'Comp. 2') )  

  # Kendall correlation of monotonically dependent data is exactly 1
  cor_kendall( mfD, ordering = 'max' )
## [1] 1
  cor_kendall( mfD, ordering = 'area' )
## [1] 1

The function cor_spearman( mfD, ordering = 'MEI', ... ) can be used to compute the Spearman correlation coefficient for a bivariate mfData object, tuning the ordering policy specified by ordering (defaulting to MEI) to rank univariate components and then compute the correlation coefficient. Besides MEI, also MHI can be used to rank univariate components.

# Spearman correlation of monotonically dependent data is exactly 1
  cor_spearman( mfD, ordering = 'MEI' )
## [1] 1
  cor_spearman( mfD, ordering = 'MHI' )
## [1] 1

4 Simulation of functional data

roahd contains also some functions that can be used to simulate artificial data sets of functional data, both univariate and multivariate. These are used in the adjustment procedure of the outliergram and functional boxplot, but can also be used to help the development of new methodologies and help their testing.

Artificial univariate data are obtained simulating realizations of a gaussian process over a discrete grid with a specific covariance function and center (e.g. mean or median). Given a covariance function, \(C(s,t)\) and a centerline \(m(t)\), the model generating data is: \[X_i(t) = m(t) + \epsilon(t), \quad Cov(\epsilon(s),\epsilon(t)) = C(s,t), \quad i = 1, \ldots, N.\]

4.1 Univariate functional data

The function generate_gauss_fdata(N, centerline, Cov = NULL, CholCov = NULL) can be used to simulate a population of such gaussian functional data. The required arguments are: N the number of elements to be generated; centerline, the center of the distribution (mean or median); Cov, a matrix representation of the desired covariance function, intended as the measurements of such function over a tensor grid whose marginal must be the grid where the functional data will be defined (i.e. the argument grid in fData); CholCov the Cholesky factor of the discrete representation of the covariance function over the tensor grid (optional and alternative to the argument Cov). The inner procedure to generate the synthetic population of gaussian functional data makes use of the Cholesky factor of Cov, hence by providing its Cholesky factor, if already present in the caller’s scope, can save computing time.

A built-in function can be used to generate exponential-like covariance functions, namely exp_cov_function( grid, alpha, beta ), generating the discretized version of a covariance of the form \(C(s,t) = \alpha e^{-\beta | s - t | }\) over a lattice given by the tensorization of grid in grid.

A comprehensive example is the following:

  N = 50
  P = 1e3

  grid = seq( 0, 1, length.out = P )
  
  Cov = exp_cov_function( grid, alpha = 0.2, beta  = 0.3 )
  
  Data = generate_gauss_fdata( N, centerline = sin( 2 * pi * grid ), Cov = Cov )
  
  fD = fData( grid, Data )
  
  plot( fD, main = 'Gaussian fData', xlab = 'grid', lwd = 2)

4.2 Multivariate functional data

The function generate_gauss_mfdata( N, L, centerline, correlations, listCov = NULL, listCholCov = NULL) can be used to generate a gaussian dataset of multivariate functional data. The model generating data is the following:

\[ X_{i,k} = m_k(t) + \epsilon_k(t), \quad Cov(\epsilon_k(s),\epsilon_k(t))=C(s,t), \quad \forall i = 1, \ldots, N, \quad \forall k = 1, \ldots, L,\]

where \(Cor( \epsilon_j(t),\epsilon_l(t))=\rho_{j,l}\) specifies a (synchronous) correlation structure among the components of the functional dataset.

In order to use the function one should provide: N, the number of elements to simulate; L, the number of components of the multivariate functional data; centerline, a matrix containing (by rows) the centerline for each component; correlations, a vector of length 1/2 * L * ( L - 1 ), containing all the correlation coefficients among the components; either listCov or listCholCov, a list containing either the discretized covariance functions over the tensorized grid where functional data will be defined), or their Cholesky factor.

A comprehensive example is the following:

  N = 10
  P = 1e3

  grid = seq( 0, 1, length.out = P )
  
  Cov_1 = exp_cov_function( grid, alpha = 0.1, beta  = 0.5 )
  Cov_2 = exp_cov_function( grid, alpha = 0.5, beta = 0.1)
  
  centerline = matrix( c( sin( 2 * pi * grid ),
                          cos( 2 * pi * grid ) ), nrow = 2, byrow = TRUE )
  
  Data = generate_gauss_mfdata( N, 2, centerline, 0.8, list( Cov, Cov ) )
  
  mfD = mfData( grid, Data )
  
  plot( mfD, main = list( 'Comp.1', 'Comp. 2'), xlab = 'grid', lwd = 2)

5 Robustification

5.1 Functional boxplot

An implementation of the functional boxplot, through the S3 method fbplot, allows for the detection of amplitude outliers in univariate and multivariate functional datasets (see Sun and Genton 2011).

fbplot can be used to compute the set of indices of observations marking outlying signals. If used in graphical way (default behavior), it also plots the functional boxplot of the dataset under study

The functional boxplot is obtained by ranking functions from the center of the distribution outwards thanks to a depth definition, computing the region of 50% most central functions and inflating such region by a factor F. Any function crossing these boundaries is flagged as outlier. The default value for F is 1.5, otherwise it can be set with the argument Fvalue.

The argument Depths can take either the name of the function to call in order to compute the depths (default is MBD), or a vector containing the depth values for the provided dataset.

An example is:

  set.seed(1618)

  N = 1e2
  P = 1e2
  
  grid = seq( 0, 1, length.out = P )
  
  Cov = exp_cov_function( grid, alpha = 0.2, beta = 0.3 )
  
  Data = generate_gauss_fdata( N, sin( 2 * pi * grid ), Cov )
  
  fD = fData( grid, Data )
  
  Data = generate_gauss_mfdata( N, 2, matrix( sin( 2 * pi * grid ), nrow = 2, ncol = P, byrow = TRUE ), 0.6, listCov = list( Cov, Cov ) )

  mfD = mfData( grid, Data )
  
  fbplot( fD, main = 'Fbplot', Fvalue = 3.5 )

  fbplot( mfD, main = list( 'Comp. 1', 'Comp. 2' ), Fvalue = 3.5 )