Abstract

The univariate Kolmogorov-Smirnov (KS) test is a non–parametric statistical test designed to assess whether two samples come from the same underlying distribution. The versatility of the KS test has made it a cornerstone of statistical analysis across the scientific disciplines. However, the test proposed by Kolmogorov and Smirnov does not naturally extend to multidimensional distributions. Here, we present the fasano.franceschini.test package, an R implementation of the 2-D KS two–sample test as defined by Fasano and Franceschini(1). The fasano.franceschini.test package provides three improvements over the current 2-D KS test on the Comprehensive R Archive Network (CRAN): (i) the Fasano and Franceschini test has been shown to run in \(O(n^2)\) versus the Peacock implementation which runs in \(O(n^3)\); (ii) the package implements a procedure for handling ties in the data; and (iii) the package implements a parallelized permutation procedure for improved significance testing. Ultimately, the fasano.franceschini.test package presents a robust statistical test for analyzing random samples defined in 2-dimensions.

Introduction

The Kolmogorov–Smirnov (KS) is a non–parametric, univariate statistical test designed to assess whether a set of data is consistent with a given probability distribution (or, in the two-sample case, whether the two samples come from the same underlying distribution). First derived by Kolmogorov and Smirnov in a series of papers (28), the one-sample KS test defines the distribution of the quantity \(D_{KS}\), the maximal absolute difference between the empirical cumulative distribution function (CDF) of a set of values and a reference probability distribution. Kolmogorov and Smirnov’s key insight was proving the distribution of \(D_{KS}\) was independent of the CDFs being tested. Thus, the test can effectively be used to compare any univariate empirical data distribution to any continuous univariate reference distribution. The two-sample KS test could further be used to compare any two univariate empirical data distributions against each other to determine if they are drawn from the same underlying univariate distribution.

The nonparametric versatility of the univariate KS test has made it a cornerstone of statistical analysis and is commonly used across the scientific disciplines (914). However, the KS test as proposed by Kolmogorov and Smirnov does not naturally extend to distributions in more than one dimension. Fortunately, a solution to the dimensionality issue was articulated by Peacock (15) and later extended by Fasano and Franceschini (1).

Currently, only the Peacock implementation of the 2-D two-sample KS test is available in R (16) with the Peacock.test package via the peacock2() function, but this has been shown to be markedly slower than the Fasano and Franceschini algorithm (17). A C implementation of the Fasano–Franceschini test is available in (18); however, arguments have been made to the validity of the implementation of the test not being distribution-free (19). Furthermore, in the C implementation, statistical testing is based on a fit to Monte Carlo simulation that is only valid for significance levels \(\alpha \lessapprox 0.20\).

Here we present the fasano.franceschini.test package as an R implementation of the 2-D two-sample KS test described by Fasano and Franceschini (1). The fasano.franceschini.test package provides two improvements over the current 2-D KS test available on the Comprehensive Archive Network (CRAN): (i) the Fasano and Franceschini test has been shown to run in \(O(n^2)\) versus the Peacock implementation which runs in \(O(n^3)\); and (ii) the package implements a permutation procedure for improved significance testing and mitigates the limitations of the test brought noted by (19).

Models and software

1-D Kolmogorov–Smirnov Test

The Kolmogorov–Smirnov (KS) test is a non–parametric method for determining whether a sample is consistent with a given probability distribution (20). In one dimension, the Kolmogorov-Smirnov statistic (\(D_{KS}\)) is the defined by the maximum absolute difference between the cumulative density functions of the data and model (one–sample), or between the two data sets (two–sample), as illustrated in Figure 1.