K-medoids Distance-Based clustering

Weksi Budiaji

2018-08-07

1. Introduction

The kmed package was designed to analyse k-medoids based clustering. The features include:

2. Distance Computation

A. Numerical variables

For numerical variables, there are four distance options, namely manhattan weighted by rank (mrw), squared euclidean weighted by rank (ser) and squared euclidean weighted by variance (sev), and unweighted squared euclidean (se). The distNumeric function provides method in which the desired distance method can be chosen. The manhattan weighted by rank is applied in the popular iris data in this article.

##            [,1]      [,2]      [,3]      [,4]       [,5]      [,6]
## [1,] 0.00000000 0.2638889 0.2530603 0.3225047 0.06944444 0.3841808
## [2,] 0.26388889 0.0000000 0.1558380 0.1419492 0.27777778 0.6480697
## [3,] 0.25306026 0.1558380 0.0000000 0.1033427 0.26694915 0.6372411
## [4,] 0.32250471 0.1419492 0.1033427 0.0000000 0.33639360 0.6727872
## [5,] 0.06944444 0.2777778 0.2669492 0.3363936 0.00000000 0.3702919
## [6,] 0.38418079 0.6480697 0.6372411 0.6727872 0.37029190 0.0000000

B. Binary and Categorical variables

For binary and categorical variables, on the other hand, two distance methods are available, i.e. simple matching (matching) and coocurrence (cooccur) distance. The matching function requires two matrices/ data frames, while the cooccurrence only needs one matrix/ data frame.

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.0000000 0.3333333 0.6666667 0.6666667 0.6666667 0.3333333 0.6666667
## [2,] 0.3333333 0.0000000 1.0000000 1.0000000 0.3333333 0.6666667 1.0000000
## [3,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [4,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [5,] 0.6666667 0.3333333 0.6666667 0.6666667 0.0000000 1.0000000 0.6666667
## [6,] 0.3333333 0.6666667 0.3333333 0.3333333 1.0000000 0.0000000 0.3333333
## [7,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.0000000 0.4500000 0.7916667 0.7916667 0.7000000 0.5416667 0.7916667
## [2,] 0.4500000 0.0000000 1.2416667 1.2416667 0.2500000 0.9916667 1.2416667
## [3,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [4,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [5,] 0.7000000 0.2500000 0.9916667 0.9916667 0.0000000 1.2416667 0.9916667
## [6,] 0.5416667 0.9916667 0.2500000 0.2500000 1.2416667 0.0000000 0.2500000
## [7,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000

C. Mixed variables

There are six distance methods for mixed variables data, namely gower (gower), wishart (wishart), podani (podani), huang (huang), harikumar-pv (harikumar), ahmad and dey (ahmad). The distmix function calculates mixed variable distance in which it requires column id of each class of variables. As an example, we create a 7 by 9 data frame that consists of three variables in each class. Then gower and wishart distances are calculated.

##   num1 num2 num3 bin1 bin2 bin3 cat1 cat2 cat3
## 1  5.1  3.5  1.4    1    2    2    1    3    3
## 2  4.9  3.0  1.4    1    2    1    2    2    3
## 3  4.7  3.2  1.3    2    1    2    1    2    1
## 4  4.6  3.1  1.5    2    1    2    1    2    3
## 5  5.0  3.6  1.4    1    1    1    2    2    2
## 6  5.4  3.9  1.7    2    2    2    1    1    3
## 7  4.6  3.4  1.4    2    1    2    2    3    2
##           1         2         3         4         5         6         7
## 1 0.0000000 0.4228395 0.5648148 0.4799383 0.5817901 0.3966049 0.5262346
## 2 0.4228395 0.0000000 0.6358025 0.5262346 0.3101852 0.7083333 0.6466049
## 3 0.5648148 0.6358025 0.0000000 0.1929012 0.5632716 0.6280864 0.3996914
## 4 0.4799383 0.5262346 0.1929012 0.0000000 0.5895062 0.4876543 0.3981481
## 5 0.5817901 0.3101852 0.5632716 0.5895062 0.0000000 0.8425926 0.4135802
## 6 0.3966049 0.7083333 0.6280864 0.4876543 0.8425926 0.0000000 0.7006173
## 7 0.5262346 0.6466049 0.3996914 0.3981481 0.4135802 0.7006173 0.0000000
##           1         2         3         4         5        6         7
## 1 0.0000000 0.6956580 0.8514500 0.9512342 0.5817152 1.212672 0.8039374
## 2 0.6956580 0.0000000 0.7345406 0.6560616 0.6778587 2.454716 0.8768878
## 3 0.8514500 0.7345406 0.0000000 0.4346564 0.8401231 2.804699 0.4706517
## 4 0.9512342 0.6560616 0.4346564 0.0000000 1.0477822 2.193826 0.5181166
## 5 0.5817152 0.6778587 0.8401231 1.0477822 0.0000000 1.668443 0.6046386
## 6 1.2126721 2.4547164 2.8046989 2.1938258 1.6684434 0.000000 2.3092201
## 7 0.8039374 0.8768878 0.4706517 0.5181166 0.6046386 2.309220 0.0000000

3. K-medoids algorithms

There are some k-medoids algorithms, partitioning around medoids (pam), for example, is available in cluster package. In kmed package, the available algorithm is simple and fast k-medoids (fastkmed), ranked k-medoids (rankkmed), and step k-medoids (stepkmed).

result <- fastkmed(mrwdist, ncluster = 3, iterate = 50)
(fastiris <- table(result$cluster, iris[,5]))
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         39         3
##   3      0         11        47
(misclass <- (1-sum(diag(fastiris))/length(iris[,5]))*100)
## [1] 9.333333

Applying fastkmed in iris data with manhattan weighted by rank, the misclassification is 9.33 %.

4. Bootstrap evaluation

A. Bootstrap replicate matrix

To evaluate a clustering algorithm, a bootstrap evaluation function (clustboot) can be applied. Before applying clustboot, a clustering algorihtm that will be evaluated must be created first. The evaluated clustering algorithm must consist of two arguments, i.e. the distance/ matrix and the number of cluster. Then, the output must be a vector of membership. We create two functions of clustering algorihtm that will be evaluated by the bootstrap. The first is a simple and fast k-medoids and the second is k-means algorithm, which is get from kmeans function in stats package.

The result is a n x b matrix, where n is the number of objects and b is the number of bootstrap replicates.

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    1    0    1    3    0    3    1     1
## [2,]    3    0    1    1    1    3    0    0    1     1
## [3,]    0    0    0    0    1    3    3    0    0     1
## [4,]    3    0    1    0    0    3    3    0    0     1
## [5,]    0    3    0    1    1    0    3    3    1     1
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    2    0    3    3    0    1    2    2     0
## [2,]    1    0    3    3    3    2    1    2    2     2
## [3,]    1    2    3    3    3    2    1    2    2     2
## [4,]    0    2    0    3    0    2    1    2    0     2
## [5,]    1    0    3    0    0    1    1    2    0     0

B. Consensus matrix

A consensus matrix (n x n) can be produced from a bootstrap replicate matrix. To create the consensus matrix, a function to order the objects is required. The consensusmatrix function provides reorder in which a reorder method can be supplied. The reorder method must have two input arguments, namely a distance matrix and a number of clusters. Meanwhile, the output is only a membership. For example, hierarchical cluster algorithm with ward linkage is applied to order the objects in the consensus matrix.

##   1 1 1 1 1         2         2         2         2         2         2
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0 0 0 0 0 1.0000000 0.8750000 0.9310345 0.6296296 0.9047619 0.7333333
## 2 0 0 0 0 0 0.8750000 1.0000000 0.8095238 0.9500000 1.0000000 0.6000000
## 2 0 0 0 0 0 0.9310345 0.8095238 1.0000000 0.5454545 0.8500000 0.7586207
## 2 0 0 0 0 0 0.6296296 0.9500000 0.5454545 1.0000000 0.8333333 0.3200000
## 2 0 0 0 0 0 0.9047619 1.0000000 0.8500000 0.8333333 1.0000000 0.6666667
## 2 0 0 0 0 0 0.7333333 0.6000000 0.7586207 0.3200000 0.6666667 1.0000000
## 2 0 0 0 0 0 0.7272727 0.6000000 0.6875000 0.3684211 0.7500000 1.0000000
## 2 0 0 0 0 0 0.9354839 0.8333333 0.9200000 0.5238095 0.8800000 0.8064516
## 2 0 0 0 0 0 0.7407407 0.6190476 0.7000000 0.3500000 0.6842105 1.0000000
## 2 0 0 0 0 0 0.7083333 0.6000000 0.7272727 0.3000000 0.6500000 1.0000000
##           2         2         2         2
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0.7272727 0.9354839 0.7407407 0.7083333
## 2 0.6000000 0.8333333 0.6190476 0.6000000
## 2 0.6875000 0.9200000 0.7000000 0.7272727
## 2 0.3684211 0.5238095 0.3500000 0.3000000
## 2 0.7500000 0.8800000 0.6842105 0.6500000
## 2 1.0000000 0.8064516 1.0000000 1.0000000
## 2 1.0000000 0.8421053 1.0000000 1.0000000
## 2 0.8421053 1.0000000 0.8214286 0.8000000
## 2 1.0000000 0.8214286 1.0000000 1.0000000
## 2 1.0000000 0.8000000 1.0000000 1.0000000

This consensus matrix can then be visualized using heatmap directly.

C. Visualization (Heatmap)

To produce a heatmap of consensus matrix clustheatmap can be applied in the consensus matrix. The consensus matrix heatmap of Iris data by simple and fast k-medoids is produced.

We can also create a heatmap of the consensus matrix fromm the k-means algorithm bootstraping.


References

Park, H., and C. Jun. 2009. “A Simple and Fast Algorithm for K-Medoids Clustering.” Expert Systems with Applications 36 (2). Elsevier:3336–41. https://doi.org/10.1016/j.eswa.2008.01.039.

Yu, D., G. Liu, M. Guo, and X. Liu. 2018. “An Improved K-Medoids Algorithm Based on Step Increasing and Optimizing Medoids.” Expert Systems with Applications 92 (February). Elsevier:464–73. https://doi.org/10.1016/j.eswa.2017.09.052.

Zadegan, S.M.R, M. Mirzaie, and F. Sadoughi. 2013. “Ranked K-Medoids A Fast and Accurate Rank-Based Partitioning Algorithm for Clustering Large Datasets.” Knowledge-Based Systems 39 (February). Elsevier:133–43. https://doi.org/10.1016/j.knosys.2012.10.012.