Uniform Manifold Approximation and Projection in R

Introduction

Uniform Manifold Approximation and Projection (UMAP) is an algorithm for dimensional reduction proposed by McInnes and Healy.

This vignette demonstrates how to use the umap R package to perform dimensional reduction and data trasnformations with the UMAP method. The vignette uses a small dataset as an example, but the package is suited to process larger data with many thousands of data points.

The vignette covers basic usage, tuning, stability and reproducibility, and discusses toggling between different implementations of the UMAP algorithm.

Usage

For a practical demonstration, let’s use the Iris dataset.

head(iris, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

The first four columns contain data and the last column contains a label. It will be useful to separate those components.

iris.data = iris[, grep("Sepal|Petal", colnames(iris))]
iris.labels = iris[, "Species"]

Creating a projection

Let’s now load the umap package and apply the UMAP transformation.

library(umap)
iris.umap = umap(iris.data)

The output, iris.umap, is an object of class umap. We can get a minimal summary of its contents by just printing it.

iris.umap
## umap embedding of 150 items in 2 dimensions
## object components: layout, data, knn, config

The main component of the object is layout, which holds a matrix with coordinates.

head(iris.umap$layout, 3)
##          [,1]      [,2]
## [1,] 15.95804 -1.418806
## [2,] 17.80407 -1.531492
## [3,] 17.59992 -2.239721

These coordinates can be used to visualize the dataset. (The custom plot function, plot.iris, is available at the end of this vignette.)

plot.iris(iris.umap, iris.labels)

Projecting new data

Once we have a umap object describing a projection of a dataset into a low-dimensional layout, we can project other data onto the same manifold.

To perform the projection, we need a second dataset with the same data features as the training data. For this demonstration, let’s create such data by adding some noise to the original data.

iris.wnoise = iris.data + matrix(rnorm(150*40, 0, 0.1), ncol=4)
colnames(iris.wnoise) = colnames(iris.data)
head(iris.wnoise, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.183373    3.525821     1.515123   0.3285499
## 2     4.872395    3.096451     1.467537   0.2094982
## 3     4.664500    3.150872     1.421417   0.2190057

We can now arrange these perturbed observations onto the same layout as before. Following R’s design pattern for fitted models, this is performed via predict.

iris.wnoise.umap = predict(iris.umap, iris.wnoise)
head(iris.wnoise.umap, 3)
##       [,1]      [,2]
## 1 15.30520 -1.921424
## 2 18.09914 -1.361412
## 3 17.47446 -2.782222

The output is a matrix with coordinates. We can visualize these points alongside the original.

plot.iris(iris.umap, iris.labels)
plot.iris(iris.wnoise.umap, iris.labels, add=T, pch=4,
          legend.suffix=" (with noise)")

The new observations lie close to their original counterparts.

Tuning UMAP

The example above uses function umap with a single argument - the input dataset - so the embedding is performed with default settings. However, the algorithm can be tuned via configuration objects or via additional arguments.

Configuration objects

The default configuration object is called umap.defaults. This is a list encoding default values for all the parameters used within the algorithm.

umap.defaults
## umap configuration parameters
##            n_neighbors: 15
##           n_components: 2
##                 metric: euclidean
##               n_epochs: 200
##                  input: data
##                   init: spectral
##               min_dist: 0.1
##       set_op_mix_ratio: 1
##     local_connectivity: 1
##              bandwidth: 1
##                  alpha: 1
##                  gamma: 1
##   negative_sample_rate: 5
##                      a: NA
##                      b: NA
##                 spread: 1
##           random_state: NA
##        transform_state: NA
##            knn_repeats: 1
##                verbose: FALSE
##        umap_learn_args: NA

For a description of each field, see the documentation in help(umap.defaults) or the original publication.

To create a custom configuration, we can first make a copy of the defaults and then update any of the fields. For example, let’s change the seed for random number generation.

custom.config = umap.defaults
custom.config$random_state = 123

We can observe the changed settings by inspecting the object again (try it). To perform the UMAP projection with these settings, we can run the projection again and provide the configuration object as a second argument.

iris.umap.2 = umap(iris.data, custom.config)
plot.iris(iris.umap.2, iris.labels,
          main="Another UMAP visualization (different seed)")

The result is slightly different than before due to a new instantiation of the random number generator.

Additional arguments

Another way to customize the algorithm is to specify the non-default parameters, one-by-one. To achieve equivalent results to the above, we can thus use

iris.umap.3 = umap(iris.data, random_state=123)

The coordinates in this output object should match the ones from iris.umap.2 (check it!)

Stability and reproducibility

The UMAP algorithm uses random numbers and thus results may vary from run to run. As we have seen, it is possible to obtain a minimal level of reproducibility by setting seeds. Nonetheless, there are some subtle points worth covering in more detail.

Initial embedding

As demonstrated in the section on tuning, embedding of a raw dataset can be stabilized by setting a seed for random number generation. However, note that results are stable only when the raw data are re-processed in exactly the same way. Results are bound to change if the data are presented in a different order.

Prediction

Projection of new data onto an existing embedding uses a separate random number generator, which can be controlled via parameter transform_state. This seed ensures that predictions executed in the exactly the same way produce identical ouput each time. In addition to stability provided by the transform_state seed, the default algorithm further implements a mechanism by which predictions are consistent when performed individually or in batch. Let’s look at an example based on the Iris data.

# predict in batch, display first item
predict(iris.umap, iris.wnoise)[1, , drop=FALSE]
##      [,1]      [,2]
## 1 15.3052 -1.921424
# predict only first item
predict(iris.umap, iris.wnoise[1,,drop=FALSE])
##      [,1]      [,2]
## 1 15.3052 -1.921424

The mechanism that enables stability between batch and individual predictions also ensures that predictions are stable when data are presented in a different order. However, this mechanism is not enforced - for performance reasons - when the new data contain more than a thousand rows. This is an implementation detail and may change in future versions.

Implementations

The package provides two implementations of the umap method, one written in R (with help from several packages from CRAN) and one accessed via an external python module.

The implementation written in R is the default. It follows the design principles of the UMAP algorithm and its running time scales better-than-quadratically with the number of items (points) in a dataset. It is thus in principle suitable for use on datasets with thousands of points. This implementation is the default because it should be functional without extensive dependencies.

The second available implementation is a wrapper for a python package umap-learn. To enable this implementation, specify the argument method.

iris.umap.4 = umap(iris.data, method="umap-learn")

This command has several dependencies. You must have the reticulate package installed and loaded (use install.packages("reticulate") and library(reticulate)). Furthermore, you must have the umap-learn python package installed (see the package repo for instructions). If either of these components is not available, the above command will display an error message.

A separate vignette explains additional aspects of interfacing with umap-learn, including handling of discrepancies between releases.

Note that it will not be possible to produce exactly the same output from the two implementations due to inequivalent random number generators in R and python, and due to slight discrepancies in the implementations.

 

Appendix

The custom plot function used to visualize the Iris dataset:

plot.iris
## function(x, labels,
##          main="A UMAP visualization of the Iris dataset",
##          colors=c("#ff7f00", "#e377c2", "#17becf"),
##          pad=0.1, cex=0.65, pch=19, add=FALSE, legend.suffix="",
##          cex.main=1, cex.legend=1) {
## 
##   layout = x
##   if (class(x)=="umap") {
##     layout = x$layout
##   } 
##   
##   xylim = range(layout)
##   xylim = xylim + ((xylim[2]-xylim[1])*pad)*c(-0.5, 0.5)
##   if (!add) {
##     par(mar=c(0.2,0.7,1.2,0.7), ps=10)
##     plot(xylim, xylim, type="n", axes=F, frame=F)
##     rect(xylim[1], xylim[1], xylim[2], xylim[2], border="#aaaaaa", lwd=0.25)  
##   }
##   points(layout[,1], layout[,2], col=colors[as.integer(labels)],
##          cex=cex, pch=pch)
##   mtext(side=3, main, cex=cex.main)
## 
##   labels.u = unique(labels)
##   legend.pos = "topright"
##   legend.text = as.character(labels.u)
##   if (add) {
##     legend.pos = "bottomright"
##     legend.text = paste(as.character(labels.u), legend.suffix)
##   }
##   legend(legend.pos, legend=legend.text,
##          col=colors[as.integer(labels.u)],
##          bty="n", pch=pch, cex=cex.legend)
## }
## <bytecode: 0x98f9758>

Summary of R session:

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS: /software/opt/R/R-3.5.1/lib/libRblas.so
## LAPACK: /software/opt/R/R-3.5.1/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] umap_0.2.3.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1      lattice_0.20-38 digest_0.6.20   RSpectra_0.14-0
##  [5] grid_3.5.1      jsonlite_1.6    magrittr_1.5    evaluate_0.14  
##  [9] stringi_1.4.3   Matrix_1.2-17   reticulate_1.12 rmarkdown_1.13 
## [13] tools_3.5.1     stringr_1.4.0   xfun_0.7        yaml_2.2.0     
## [17] compiler_3.5.1  askpass_1.1     htmltools_0.3.6 openssl_1.4    
## [21] knitr_1.23