Interfacing with python package ‘umap-learn’

Introduction

(For general information on usage of package umap, see the introductory vignette.)

R package umap provides an interface to uniform manifold approximation and projection (UMAP) algorithms. There are now several implementations, including some provided by versions of python package `umap-learn’. This vignette explains some nuanced aspects of interfacing with the python package.

Usage

As prep, let’s load the package and prepare a small dataset.

library(umap)
iris.data = iris[, grep("Sepal|Petal", colnames(iris))]

The basic command to perform dimensional reduction is umap. By default, this function uses an implementation written in R. To use an alternative implementation via the umap-learn python package, that package and its dependencies must be installed separately (see python package index or the package source). You must also install and load the reticulate library (use install.packages('reticulate') and library('reticulate')).

After completing installations, the UMAP transformation can be performed by specifying a method argument.

iris.umap_learn = umap(iris.data, method="umap-learn")

Tuning umap-learn

As covered in the introductory vignette, tuning parameters can be set via a configuration object and via explicit arguments in the umap function call. The default configuration is accessible as object umap.defaults.

umap.defaults
## umap configuration parameters
##            n_neighbors: 15
##           n_components: 2
##                 metric: euclidean
##               n_epochs: 200
##                  input: data
##                   init: spectral
##               min_dist: 0.1
##       set_op_mix_ratio: 1
##     local_connectivity: 1
##              bandwidth: 1
##                  alpha: 1
##                  gamma: 1
##   negative_sample_rate: 5
##                      a: NA
##                      b: NA
##                 spread: 1
##           random_state: NA
##        transform_state: NA
##            knn_repeats: 1
##                verbose: FALSE
##        umap_learn_args: NA

Note the entry umap_learn_args toward the end. This is set to NA by default, indicating that appropriate arguments will be selected automatically and passed to umap-learn.

After executing dimensional reduction, the output object contains a copy of the configuration with the values actually used to produce the output.

## should display a configuration summary
iris.umap_learn$config

Note that the entry for umap_learn_args contains a vector of all the arguments passed from the configuration object to the python package. An entry in the configuration should also reveal the version of the python package used to perform the calculation.

Discussion

Verifying arguments

A configuration object can contain many components, but not all may be used in a calculation. To verify that a setting is actually used, ensure that it appears in umap_learn_args in the output.

As an example, consider setting foo and n_epochs during the function call.

## (not evaluated in vignette)
iris.foo = umap(iris.data, method="umap-learn", foo=4, n_epochs=100)
iris.foo$config

Inspecting the output configuration will reveal that both foo and n_epochs are recorded (in the latter case, the default value is replaced by the new value). However, foo does not appear in umap_learn_args, revealing that this setting was not actually used in the calculation.

Versions

Various version of umap-learn take different parameters as input. The R package is coded to work with umap-learn versions 0.2 and 0.3 and will adjust arguments automatically to suit those versions.

Note, however, that some arguments that are acceptable in 0.3 are not set in the default configuration object. To use those features (see python package documentation), set the appropriate arguments manually, either by preparing a custom configuration object or by specifying the arguments during the umap function call.

Custom constructors

It is possible to set umap_learn_args manually while calling umap.

## (not evaluated in vignette) 
iris.custom = umap(iris.data, method="umap-learn",
                   umap_learn_args=c("n_neighbors", "n_epochs"))
iris.custom$config

Here, only the two specified arguments have been passed on to the calculation.

 

Appendix

Summary of R session:

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS: /software/opt/R/R-3.5.1/lib/libRblas.so
## LAPACK: /software/opt/R/R-3.5.1/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] umap_0.2.2.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1      lattice_0.20-38 digest_0.6.18   RSpectra_0.14-0
##  [5] grid_3.5.1      jsonlite_1.6    magrittr_1.5    evaluate_0.13  
##  [9] stringi_1.4.3   Matrix_1.2-16   reticulate_1.12 rmarkdown_1.11 
## [13] tools_3.5.1     stringr_1.4.0   xfun_0.4        yaml_2.2.0     
## [17] compiler_3.5.1  htmltools_0.3.6 knitr_1.21