K-medoids Distance-Based clustering

Weksi Budiaji

2018-02-09

1. Introduction

The kmed package was designed to analyse k-medoids based clustering; the features include:

2. Distance Computation

A. Numerical variables

There are four distance options, i.e. manhattan weighted by rank (mrw), squared euclidean weighted by rank (ser) and variance (sev), and squared euclidean unweighted (se). The distNumeric in the method provides its desired distance method.

##            [,1]      [,2]      [,3]      [,4]       [,5]      [,6]
## [1,] 0.00000000 0.2638889 0.2530603 0.3225047 0.06944444 0.3841808
## [2,] 0.26388889 0.0000000 0.1558380 0.1419492 0.27777778 0.6480697
## [3,] 0.25306026 0.1558380 0.0000000 0.1033427 0.26694915 0.6372411
## [4,] 0.32250471 0.1419492 0.1033427 0.0000000 0.33639360 0.6727872
## [5,] 0.06944444 0.2777778 0.2669492 0.3363936 0.00000000 0.3702919
## [6,] 0.38418079 0.6480697 0.6372411 0.6727872 0.37029190 0.0000000

B. Binary and Categorical variables

Two options of distances are available for binary or categorical variables, namely simple matching (matching) and coocurrence (coocurrence) distance.

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.0000000 0.3333333 0.6666667 0.6666667 0.6666667 0.3333333 0.6666667
## [2,] 0.3333333 0.0000000 1.0000000 1.0000000 0.3333333 0.6666667 1.0000000
## [3,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [4,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [5,] 0.6666667 0.3333333 0.6666667 0.6666667 0.0000000 1.0000000 0.6666667
## [6,] 0.3333333 0.6666667 0.3333333 0.3333333 1.0000000 0.0000000 0.3333333
## [7,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.0000000 0.4500000 0.7916667 0.7916667 0.7000000 0.5416667 0.7916667
## [2,] 0.4500000 0.0000000 1.2416667 1.2416667 0.2500000 0.9916667 1.2416667
## [3,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [4,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [5,] 0.7000000 0.2500000 0.9916667 0.9916667 0.0000000 1.2416667 0.9916667
## [6,] 0.5416667 0.9916667 0.2500000 0.2500000 1.2416667 0.0000000 0.2500000
## [7,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000

C. Mixed variables

There are five distances for mixed variables data, i.e gower (gower), wishart (wishart), podani (podani), huang (huang) and harikumar-pv (harikumar).

##           1         2         3         4         5         6         7
## 1 0.0000000 0.4228395 0.5648148 0.4799383 0.5817901 0.3966049 0.5262346
## 2 0.4228395 0.0000000 0.6358025 0.5262346 0.3101852 0.7083333 0.6466049
## 3 0.5648148 0.6358025 0.0000000 0.1929012 0.5632716 0.6280864 0.3996914
## 4 0.4799383 0.5262346 0.1929012 0.0000000 0.5895062 0.4876543 0.3981481
## 5 0.5817901 0.3101852 0.5632716 0.5895062 0.0000000 0.8425926 0.4135802
## 6 0.3966049 0.7083333 0.6280864 0.4876543 0.8425926 0.0000000 0.7006173
## 7 0.5262346 0.6466049 0.3996914 0.3981481 0.4135802 0.7006173 0.0000000
##           1         2         3         4         5        6         7
## 1 0.0000000 0.6956580 0.8514500 0.9512342 0.5817152 1.212672 0.8039374
## 2 0.6956580 0.0000000 0.7345406 0.6560616 0.6778587 2.454716 0.8768878
## 3 0.8514500 0.7345406 0.0000000 0.4346564 0.8401231 2.804699 0.4706517
## 4 0.9512342 0.6560616 0.4346564 0.0000000 1.0477822 2.193826 0.5181166
## 5 0.5817152 0.6778587 0.8401231 1.0477822 0.0000000 1.668443 0.6046386
## 6 1.2126721 2.4547164 2.8046989 2.1938258 1.6684434 0.000000 2.3092201
## 7 0.8039374 0.8768878 0.4706517 0.5181166 0.6046386 2.309220 0.0000000

3. K-medoids algorithms

There are some k-medoids algorithms, partitioning around medoids, for example, is available in cluster package. For this moment, in kmed package, the available algorithm is park and jun.

result <- fastkmed(mrwdist, ncluster = 3, iterate = 50)
table(result$cluster, iris[,5])
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         39         3
##   3      0         11        47

4. Bootstrap evaluation

A. Bootstrap replicate matrix

To evaluate the clustering algorithm, a bootstrap evaluation can be run by make a function. This function input arguments must be only a distance matrix and a number of cluster and the output is only membership.

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    1    0    1    3    0    3    1     1
## [2,]    3    0    1    1    1    3    0    0    1     1
## [3,]    0    0    0    0    1    3    3    0    0     1
## [4,]    3    0    1    0    0    3    3    0    0     1
## [5,]    0    3    0    1    1    0    3    3    1     1

We can change the algorithm, for example kmeans from stats package.

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    2    0    3    3    0    1    2    2     0
## [2,]    1    0    3    3    3    2    1    2    2     2
## [3,]    1    2    3    3    3    2    1    2    2     2
## [4,]    0    2    0    3    0    2    1    2    0     2
## [5,]    1    0    3    0    0    1    1    2    0     0

B. Consensus matrix

A consensus matrix (n x n) can be produced from a bootstrap replicate matrix. The reorder input is a function to reorder the objects in the consensus matrix. This function input arguments must be only a distance matrix and a number of cluster and the output is only membership. This matrix can be visualized using heatmap directly.

##   1 1 1 1 1         2         2         2         2         2         2
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0 0 0 0 0 1.0000000 0.8750000 0.9310345 0.6296296 0.9047619 0.7333333
## 2 0 0 0 0 0 0.8750000 1.0000000 0.8095238 0.9500000 1.0000000 0.6000000
## 2 0 0 0 0 0 0.9310345 0.8095238 1.0000000 0.5454545 0.8500000 0.7586207
## 2 0 0 0 0 0 0.6296296 0.9500000 0.5454545 1.0000000 0.8333333 0.3200000
## 2 0 0 0 0 0 0.9047619 1.0000000 0.8500000 0.8333333 1.0000000 0.6666667
## 2 0 0 0 0 0 0.7333333 0.6000000 0.7586207 0.3200000 0.6666667 1.0000000
## 2 0 0 0 0 0 0.7272727 0.6000000 0.6875000 0.3684211 0.7500000 1.0000000
## 2 0 0 0 0 0 0.9354839 0.8333333 0.9200000 0.5238095 0.8800000 0.8064516
## 2 0 0 0 0 0 0.7407407 0.6190476 0.7000000 0.3500000 0.6842105 1.0000000
## 2 0 0 0 0 0 0.7083333 0.6000000 0.7272727 0.3000000 0.6500000 1.0000000
##           2         2         2         2
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0.7272727 0.9354839 0.7407407 0.7083333
## 2 0.6000000 0.8333333 0.6190476 0.6000000
## 2 0.6875000 0.9200000 0.7000000 0.7272727
## 2 0.3684211 0.5238095 0.3500000 0.3000000
## 2 0.7500000 0.8800000 0.6842105 0.6500000
## 2 1.0000000 0.8064516 1.0000000 1.0000000
## 2 1.0000000 0.8421053 1.0000000 1.0000000
## 2 0.8421053 1.0000000 0.8214286 0.8000000
## 2 1.0000000 0.8214286 1.0000000 1.0000000
## 2 1.0000000 0.8000000 1.0000000 1.0000000

C. Visualization (Heatmap)

To produce a heatmap of consensus matrix clustheatmap can be applied in the consensus matrix. The consensus matrix heatmap of Iris data by Park and Jun is produced.

We can also create a heatmap of the kmeans algorithm for the iris data.