Distance Measures

How to use distance()

The distance() function implemented in philentropy is able to compute 46 different distances/similarities between probability density functions (see ?philentropy::distance for details).

Simple Example

The distance() function is implemented using the same logic as R’s base functions stats::dist() and takes a matrix or data.frame as input. The corresponding matrix or data.frame should store probability density functions (as rows) for which distance computations should be performed.

# define a probability density function P
P <- 1:10/sum(1:10)
# define a probability density function Q
Q <- 20:29/sum(20:29)

# combine P and Q as matrix object
x <- rbind(P,Q)

Please note that when defining a matrix from vectors, probability vectors should be combined as rows (rbind()).

library(philentropy)

# compute the Euclidean Distance with default parameters
distance(x, method = "euclidean")
euclidean 
0.1280713

For this simple case you can compare the results with R’s base function to compute the euclidean distance stats::dist().

# compute the Euclidean Distance using R's base function
stats::dist(x, method = "euclidean")
          P
Q 0.1280713

However, the R base function stats::dist() only computes the following distance measures: "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski", whereas distance() allows you to choose from 46 distance/similarity measures.

To find out which methods are implemented in distance() you can consult the getDistMethods() function.

# names of implemented distance/similarity functions
getDistMethods()
 [1] "euclidean"         "manhattan"         "minkowski"         "chebyshev"        
 [5] "sorensen"          "gower"             "soergel"           "kulczynski_d"     
 [9] "canberra"          "lorentzian"        "intersection"      "non-intersection" 
[13] "wavehedges"        "czekanowski"       "motyka"            "kulczynski_s"     
[17] "tanimoto"          "ruzicka"           "inner_product"     "harmonic_mean"    
[21] "cosine"            "hassebrook"        "jaccard"           "dice"             
[25] "fidelity"          "bhattacharyya"     "hellinger"         "matusita"         
[29] "squared_chord"     "squared_euclidean" "pearson"           "neyman"           
[33] "squared_chi"       "prob_symm"         "divergence"        "clark"            
[37] "additive_symm"     "kullback-leibler"  "jeffreys"          "k_divergence"     
[41] "topsoe"            "jensen-shannon"    "jensen_difference" "taneja"           
[45] "kumar-johnson"     "avg"

Now you can choose any distance/similarity method that serves you.

# compute the Jaccard Distance with default parameters
distance(x, method = "jaccard")
 jaccard 
0.133869

Analogously, in case a probability matrix is specified the following output is generated.

# combine three probabilty vectors to a probabilty matrix 
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))

# compute the euclidean distance between all 
# pairwise comparisons of probability vectors
distance(ProbMatrix, method = "euclidean")
          vec.1      vec.2      vec.3
vec.1 0.0000000 0.12807130 0.13881717
vec.2 0.1280713 0.00000000 0.01074588
vec.3 0.1388172 0.01074588 0.00000000

This output differs from the output of stats::dist().

# compute the euclidean distance between all 
# pairwise comparisons of probability vectors
# using stats::dist()
stats::dist(ProbMatrix, method = "euclidean")
           1          2
2 0.12807130           
3 0.13881717 0.01074588

Whereas distance() returns a symmetric distance matrix, stats::dist() returns only one part of the symmetric matrix.

Now let’s compare the run times of base R and philentropy. For this purpose you need to install the microbenchmark package.

# install.packages("microbenchmark")
library(microbenchmark)

microbenchmark(
  distance(x,method = "euclidean", test.na = FALSE),
  dist(x,method = "euclidean"),
  euclidean(x[1 , ], x[2 , ], FALSE)
)
Unit: microseconds
                                               expr    min      lq     mean  median      uq    max neval
 distance(x, method = "euclidean", test.na = FALSE) 26.518 28.3495 29.73174 29.2210 30.1025 62.096   100
                      dist(x, method = "euclidean") 11.073 12.9375 14.65223 14.3340 15.1710 65.130   100
                   euclidean(x[1, ], x[2, ], FALSE)  4.329  4.9605  5.72378  5.4815  6.1240 22.510   100

As you can see, although the distance() function is quite fast, the internal checks cause it to be 2x slower than the base dist() function (for the euclidean example). Nevertheless, in case you need to implement a faster version of the corresponding distance measure you can type philentropy:: and then TAB allowing you to select the base distance computation functions (written in C++), e.g. philentropy::euclidean() which is almost 3x faster than the base dist() function.

The advantage of distance() is that it implements 46 distance measures based on base C++ functions that can be accessed individually by typing philentropy:: and then TAB. In future versions of philentropy I will optimize the distance() function so that internal checks for data type correctness and correct input data will take less termination time than the base dist() function.

Detailed assessment of individual similarity and distance metrics

The vast amount of available similarity metrics raises the immediate question which metric should be used for which application. Here, I will review the origin of each individual metric and will discuss the most recent literature that aims to compare these measures. I hope that users will find valuable insights and might be stimulated to conduct their own comparative research since this is a field of ongoing research.

\(L_p\) Minkowski Family

Euclidean distance

The euclidean distance is (named after Euclid) a straight line distance between two points. Euclid argued that that the shortest distance between two points is always a line.

\(d = \sqrt{\sum_{i = 1}^N | P_i - Q_i |^2)}\)

Manhattan distance

\(d = \sum_{i = 1}^N | P_i - Q_i |\)

Minkowski distance

\(d = ( \sum_{i = 1}^N | P_i - Q_i |^p)^{1/p}\)

Chebyshev distance

\(d = max | P_i - Q_i |\)