CRAN Task View: Cluster Analysis & Finite Mixture Models
This CRAN Task View contains a list of packages that can be
used for finding groups in data and modeling unobserved
cross-sectional heterogeneity. Many packages provide functionality for
more than one of the topics listed below, the section headings are
mainly meant as quick starting points rather than an ultimate
categorization. Except for packages stats and cluster (which ship with
base R and hence are part of every R installation), each package is
listed only once.
Most of the packages listed in this CRAN Task View, but not all are
distributed under the GPL. Please have a look at the DESCRIPTION file
of each package to check under which license it is distributed.
from package stats and
are the primary
functions for agglomerative hierarchical clustering, function
can be used for divisive hierarchical
clustering. Faster alternatives to
provided by the packages
from stats and associated
methods can be used for improved visualization for cluster
package provides functions for easy
visualization (coloring labels and branches, etc.), manipulation
(rotating, pruning, etc.) and comparison of dendrograms (tangelgrams
with heuristics for optimal branch rotations, and tree correlation
measures with bootstrap and permutation tests for
contains methods for detection
of clusters in hierarchical clustering dendrograms.
implements a fast hierarchical
clustering algorithm with a linkage criterion which is a variant of
the single linkage method combining it with the Gini inequality
measure to robustify the linkage method while retaining
computational efficiency to allow for the use of larger data
implements hybrid hierarchical
clustering via mutual clusters.
uses an algorithm which is based on
the classification of ordination scores from isometric feature
mapping. The classification is performed either as a hierarchical,
divisive method or as non-hierarchical partitioning.
implements a form of
hierarchical clustering that associates a prototypical element with
each interior node of the dendrogram. Using the package's
function, one can produce dendrograms that are
prototype-labeled and are therefore easier to interpret.
is a package for assessing the uncertainty in
hierarchical cluster analysis. It provides approximately
unbiased p-values as well as bootstrap p-values.
provides clustering for a set of
variables are available, where
. It adaptively chooses a set of variables
to use in clustering the observations. Sparse K-means clustering and
sparse hierarchical clustering are implemented.
from package stats provides
several algorithms for computing partitions with respect to
partitioning around medoids and can work with arbitrary
for larger data sets. Silhouette plots
and spanning ellipses can be used for visualization.
implements Frey's and Dueck's
Affinity Propagation clustering. The algorithms in the package are analogous
to the Matlab code published by Frey and Dueck.
allows to search for the optimal
clustering procedure for a given dataset.
implements Huang’s k-prototypes
extension of k-means for mixed type data.
implements various clustering
algorithms that produce a credal partition, i.e., a set of
Dempster-Shafer mass functions representing the membership of
objects to clusters.
provides k-centroid cluster
algorithms for arbitrary distance measures, hard competitive
learning, neural gas and QT clustering. Neighborhood graphs and
image plots of partitions are available for visualization. Some of
this functionality is also provided by package
provides a weighted kernel version of
the k-means algorithm by
clustering specifically for longitudinal (joint) data.
allows spherical k-Means Clustering,
i.e. k-means clustering with cosine similarity. It features several
methods, including a genetic and a simple fixed-point algorithm and
an interface to the CLUTO vcluster program for clustering
provides trimmed k-means
also allows for trimmed
k-means clustering. In addition using this package other covariance
structures can also be specified for the clusters.
For semi- or partially supervised problems, where for a part of
the observations labels are given with certainty or with some
provides belief-based and
soft-label mixture modeling for mixtures of Gaussians with the EM
provides EM algorithms and several
efficient initialization methods for model-based clustering of
finite mixture Gaussian distribution with unstructured dispersion in
unsupervised as well as semi-supervised learning situation.
implement model-based functional data
package implements the
algorithm which allows to cluster time series or, more generally,
functional data. It is based on a discriminative functional mixture
model which allows the clustering of the data in a unique and
discriminative functional subspace. This model presents the
advantage to be parsimonious and can therefore handle long time
package implements the funHDDC algorithm
which allows the clustering of functional data within group-specific
functional subspaces. The funHDDC algorithm is based on a functional
mixture model which models and clusters the data into group-specific
functional subspaces. The approach allows afterward meaningful
interpretations by looking at the group-specific functional
is a subspace clustering method
which allows for efficient unsupervised classification of
high-dimensional data. It is based on the Gaussian mixture model and
on the idea that the data lives in a common and low dimensional
subspace. An EM-like algorithm estimates both the discriminative
subspace and the parameters of the mixture model.
to fit Gaussian mixture model to high-dimensional data where it is
assumed that the data lives in a lower dimension than the original
allows to fit multivariate
t-distribution mixture models (with eigen-decomposed covariance
structure) from a clustering or classification point of
allows to fit these models as
well as Gaussian mixture models to longitudinal data.
fits mixtures of Gaussians using the EM
algorithm. It allows fine control of volume and shape of covariance
matrices and agglomerative hierarchical clustering based on maximum
likelihood. It provides comprehensive strategies using hierarchical
clustering, EM and the Bayesian Information Criterion (BIC) for
clustering, density estimation, and discriminant analysis. Package
provides tools for fitting mixture models of
multivariate Gaussian or multinomial components to a given data set
with either a clustering, a density estimation or a discriminant
analysis point of view. Package
as well as packages
provide all 14 possible
variance-covariance structures based on the eigenvalue
fits mixtures of probabilistic
principal component analysis with the EM algorithm.
For grouped conditional data package
provides fitting with the EM algorithm for
parametric and non-parametric (multivariate) mixtures. Parametric
mixtures include mixtures of multinomials, multivariate normals,
normals with repeated measures, Poisson regressions and Gaussian
regressions (with random effects). Non-parametric mixtures include
the univariate semi-parametric case where symmetry is imposed for
identifiability and multivariate non-parametric mixtures with
conditional independent assumption. In addition fitting mixtures of
Gaussian regressions with the Metropolis-Hastings algorithm is
Fitting finite mixtures of uni- and multivariate scale mixtures
of skew-normal distributions with the EM algorithm is provided by
fits finite mixtures of von
Mises-Fisher distributions with the EM algorithm.
fits mixtures of generalized lambda
distributions and for grouped conditional data package
can be used.
provides tools for classification using normal
mixture models and (higher resolution) hidden Markov normal mixture
models fitted by various methods.
Parsimonious Gaussian mixture models allow to fit mixtures of
factor analyzers with a constraints on the components of the factor
models. Functionality to fit these models is provided in package
clusters a presence-absence matrix
object by calculating an MDS
from the distances, and applying maximum likelihood Gaussian
mixtures clustering to the MDS
estimates mixtures of the
dichotomous Rasch model (via conditional ML) and the Bradley-Terry
estimates mixture Rasch models,
including the dichotomous Rasch model, the rating scale model, and
the partial credit model with joint maximum likelihood estimation.
allows to use unsupervised
model-based clustering for high dimensional (ultra) large data. The
package uses pbdMPI to perform a parallel version of the EM
algorithm for mixtures of Gaussians.
Bayesian estimation of finite mixtures of multivariate Gaussians
is possible using package
bayesm. The package provides
functionality for sampling from such a mixture as well as estimating
the model using Gibbs sampling. Additional functionality for
analyzing the MCMC chains is available for averaging
the moments over MCMC draws, for determining the marginal densities,
for clustering observations and for plotting the uni- and bivariate
provides various Markov Chain
Monte Carlo samplers for model-based clustering of discrete-valued
time series obtained by observing a categorical variable with
several states using a Bayesian approach.
provides Bayesian estimation using
allows Bayesian clustering using a
spike-and-slab hierarchical model and is suitable for clustering
provides Bayesian Sampling for
fits Dirichlet process mixture
models using conjugate models with normal structure. Package
determines the maximum posterior estimate for
product partition models where the Dirichlet process mixture is a
specific case in the class.
fits mixtures of gamma distributions.
contains a mixture of statistical
methods including the MCMC methods to analyze normal mixtures with
possibly censored data.
implements methods for processing a
sample of (hard) clusterings, e.g. the MCMC output of a Bayesian
clustering model. Among them are methods that find a single best
clustering to represent the sample, which are based on the posterior
similarity matrix or a relabeling algorithm.
is a package for profile regression,
which is a Dirichlet process Bayesian clustering where the response
is linked non-parametrically to the covariate profile.
provides an interface to the JAGS
MCMC library which includes a module for mixture modelling.
Other estimation methods:
allows to fit an adaptive mixture of Student-t
distributions to approximate a target density through its kernel
estimates densities with a penalized
Circular and orthogonal regression clustering using redescending
M-estimators is provided by package
Robust estimation using Weighted Likelihood can be done with
Other Cluster Algorithms:
allows to cluster high dimensional
data based on a two dimensional decision plot. This density-distance
plot plots for each data point the local density against the
shortest distance to all observations with a higher local density
value. The cluster centroids of this non-iterative procedure can be
selected using an interactive or automatic selection mode.
provides alternative implementations
of k-means and agglomerative hierarchical clustering.
provides several algorithms to find
biclusters in two-dimensional data.
implements clustering techniques for
business analytics like "rock" and "proximus".
clusters 3-dimensional data into
their local modes based on a convergent form of Choi and Hall's
(1999) data sharpening method.
implements ensemble methods for both
hierarchical and partitioning cluster methods.
implements a cluster algorithm that
is based on copula functions and therefore allows to group
observations according to the multivariate dependence structure of
the generating process without any assumptions on the margins.
Fuzzy clustering and bagged clustering are available in package
e1071. Further and more extensive tools for fuzzy
clustering are available in package
hierarchical clustering which was especially designed for microarray
data to uncover structures present in the data that arise from
provides a fast reimplementaiton of
the DBSCAN (density-based spatial clustering of applications with
noise) algorithm using a kd-tree.
performs a combination of
factorial methods and cluster analysis.
algorithm is a hybrid between
hierarchical methods and PAM and builds a tree by
recursively partitioning a data set.
implements the algorithm of the same
name for visualizing very large high-dimensional datasets. Regarding
clustering optimized implementations of the HDBSCAN*, DBSCAN and
OPTICS algorithms are provided in combination with a very fast search
for approximate nearest neighbors and outlier detection.
For graphs and networks model-based clustering approaches are
implemented in packages
contains a set of algorithms for
creating partitions and coverings of objects largely based on
operations on similarity relations (or matrices).
provides tools to perform cluster
analysis via kernel density estimation. Clusters are associated to
the maximally connected components with estimated density above a
threshold. In addition a tree structure associated with the
connected components is obtained.
provides the fitting of latent
class models which optionally also include a random effect. Package
allows for polytomous variable latent class
analysis and regression.
allows to fit Bayesian
LCA models employing the EM algorithm, Gibbs sampling or variational
fits recursively partitioned mixture
models for Beta and Gaussian Mixtures. This is a model-based
clustering algorithm that returns a hierarchy of classes, similar to
hierarchical clustering, but also similar to finite mixture
Self-organizing maps are available in package
Several packages provide cluster algorithms which have been
developed for bioinformatics applications. These packages include
for profiling microarray expression data
for order-restricted information-based clustering.
Multigroup mixtures of latent Markov models on mixed categorical
and continuous data (including time series) can be fitted using
depmixS4. The parameters are
optimized using a general purpose optimization routine given linear
and nonlinear constraints on the parameters.
allows for maximum likelihood fitting of
cluster-weighted models, a class of mixtures of regression models
with random covariates.
implements an user-extensible
framework for EM-estimation of mixtures of regression models,
including mixtures of (generalized) linear models.
provides fixed-point methods both for
model-based clustering and linear regression. A collection of
asymmetric projection methods can be used to plot various
aspects of a clustering.
fits a latent class linear mixed model
which is also known as growth mixture model or heterogeneous linear
mixed model using a maximum likelihood method.
fits mixtures of one-variable
regressions and provides the bootstrap test for the number of
fits mixtures of proportional hazard models
with the EM algorithm.
finite mixtures of gamlss family distributions.
Mixtures of univariate normal distributions can be printed
and plotted using package
to visualise the results of clustering algorithms.
contains functions for
generating random clusters and random covariance/correlation
matrices, calculating a separation index (data and population
version) for pairs of clusters or cluster distributions, and 1-D and
2-D projection plots to visualize clusters.
generates a finite mixture model
with Gaussian components for prespecified levels of maximum and/or
average overlaps. This model can be used to simulate data for
studying the performance of cluster algorithms.
For cluster validation package
reproducibility of a cluster. Package
popular internal and external cluster validation methods ready to
use for most of the outputs produced by functions from package
provides variable selection for
Functionality to compare the similarity between two cluster
solutions is provided by
The stability of k-centroid clustering solutions fitted using
functions from package
can also be validated
using bootstrap methods.
provides methods to analyze cluster
alternatives based on multi-objective optimization of cluster
implements 30 different indices which
evaluate the cluster structure and should help to determine on a
suitable number of clusters.
visualizing dissimilarity matrices using seriation and matrix shading.
This also allows to inspect cluster quality by restricting objects
belonging to the same cluster to be displayed in consecutive order.
provides a statistical method for
testing the significance of clustering results.
between data points based on their leaf memberships in regression or
classification trees for each variable. It also performs the cluster
analysis using the resulting dissimilarity matrix with available
heuristic clustering algorithms in R.