This Vingette provides an overview of the largeVis package.

This package provides `LargeVis`

visualizations and fast nearest-neighbor search. The `LargeVis`

algorithm, presented in Tang et al. (2016), creates high-quality low-dimensional representaitons of large, high-dimensional datasets, similar to t-SNE.

These visualizations are useful for data exploration, for visualizing complex non-linear functions, and especially for visualizing embeddings such as learned vectors for images.

A limitation of t-SNE is that because the algorithm has complexity order \(O(n^2)\), it is not feasible for use on even moderately sized datasets. Barnes-Hut, an approximation of t-SNE, has complexity \(O(n \log n)\) but also quickly becomes infeasible as the size of data grows. `LargeVis`

is intended to address the issue by operating in linear \(O(n)\) time. It has been benchmarked at more than 30x faster than Barnes-Hut on datasets of approximately 1-million rows, and scaled linearly as long as there is sufficient RAM.

In addition, `LargeVis`

includes an algorithm for finding approximate k-Nearest Neighbors in \(O(n)\) time. This algorithm turns out to be faster at finding accurate a-NNs than any other method I was able to test.

The package also includes a function for visualizing image embeddings by plotting images at the locations given by the `LargeVis`

algorithm.

For a detailed description of the algorithm, please see the original paper, Tang et al. (2016).

The `largeVis`

package offers five functions for visualizing high-dimensional datasets and finding approximate nearest neighbors (along with some helper functions):

`randomProjectionTreeSearch`

, a method for finding approximate nearest neighbors.`projectKNNs`

, which takes as input a weighted nearest-neighbor graph and estimates a projection into a low-dimensional space.`largeVis`

, which implements the entire`LargeVis`

algorithm.`manifoldMap`

(and companon`ggManifoldMap`

), which produce a plot for visualizing embeddings of images.`buildWijMatrix`

takes a sparse matrix of the distances between nearest neighbors, and returns one with the edges properly weighted for use in`projectKNNs`

.

See the original paper for a detailed description of the algorithm.

For input to `largeVis`

, data should be scaled, NA’s, Infs and NULL removed, and transposed from the R-standard so that examples are columns and features are rows. Duplicates should be removed as well.

If there are NA’s, Infs, or NULLs in the input, `randomProjectionTreeSearch`

will definitely fail.

If the numerical range covered by the data is large, this can cause errors in or before the `buildEdgeMatrix`

function. This is because the algorithm requires calculating \(\exp(||\vec{x_i}, \vec{x_j}||^2)\) in the high-dimensional space, which will overflow if the distance between any nearest neighbors exceeds about 26.

Duplicates in the input data are likely to cause issues. If the number of duplicates is large, this can cause the random projection tree search to fail. If the number is small, the algorithm may identify a sufficient number of neighbors, but an error may then occur during `buildEdgeMatrix`

, or stochastic gradient descent.

`randomProjectionTreeSearch`

This function uses a two-phase algorithm to find approximate nearest neighbors. In the first phase, which is based on Erik Bernhardsson‘s Annoy algorithm, `n_trees`

trees are formed by recursively dividing the space by hyperplanes until at most `tree_threshold`

nodes remain in a branch. A node’s candidate nearest neighbors are the union of all nodes with which it shared a leaf on any of the trees. The `largeVis`

algorithm adds a second phase, neighborhood exploration, which considers, for each node, whether the candidate neighbors of the node’s candidate immediate neighbors are closer. The logic of the algorithm is that a node’s neighbors’ neighbors are likely to be the node’s own neighbors. In each iteration, the closest `K`

candidate neighbors for each node are kept.

(Note that this implementation of `largeVis`

differs from the approach taken by `Annoy`

, in that `Annoy`

always uses the number of features as the leaf threshold, where `largeVis`

allows this to be an adjustable parameter.)

The authors of Tang et al. (2016) suggest that a single iteration of the second phase is generally sufficient to obtain satisfactory performance.

See the vignette “ANN Benchmarks” for additional information.

`projectKNNs`

This function takes as its input a `Matrix::sparseMatrix`

, of connections between nodes. The matrix must be symmetric. A non-zero cell implies that node `i`

is a nearest neighbor of node `j`

, vice-versa, or both. Non-zero values represent the strength of the connection relative to other nearest neighbors of the two nodes.

The `LargeVis`

algorithm, explained in detail in Tang et al. (2016), estimates the embedding by sampling from the identitied nearest-neighbor connections. For each edge, the algorithm also samples `M`

non-nearest neighbor negative samples. `M`

, along with \(\gamma\) and \(\alpha\), control the visualization. \(\alpha\) controls the desired distance between nearest neighbors. \(\gamma\) controls the relative strength of the attractive force between nearest neighbors and repulsive force between non-neighbors.

The following grid illustrates the effect of the \(\alpha\) and \(\gamma\) hyperparameters:

`load(system.file(package = "largeVis", "vignettedata/vignettedata.Rda"))`

The additional hyperparameters \(\rho\) and `min-`

\(\rho\) control the starting and final learning rate for the stochastic gradient descent process.

The algorithm can treat positive edge weights in two different ways. The authors of Tang et al. (2016) suggest that edge weights should be used to generate a weighted sampling. However, the algorithm for taking a weighted sample runs in \(O(n \log n)\). Alternatively, the edge-weights can be applied to the gradients. This is controlled by the `weight_pos_samples`

parameter.

`vis`

The `vis`

function combines `randomProjectionTreeSearch`

and `projectKNNs`

, along with additional logic for calculating edge weights, to implement the complete `LargeVis`

algorithm.

The following chart illustrates the effect of the `M`

and `K`

parameters, using the `iris`

dataset. Each row re-uses the same set of identified `K`

neighbors, and initial coordinates.

`manifoldMap`

The `manifoldMap`

function is useful when the examples being clustered are themselves images. Given a coordinate matrix (as generated by `projectKNNs`

or `vis`

) and an `array`

of `N`

images, the function samples `n`

images and plots them at the coordinates given in the matrix.

The following code will generate the visualization shown in the examples:

```
dim(trainData) <- c(60000, 28, 28)
aperm(trainData, perm = c(1,3,2), resize = FALSE)
set.seed(1974)
manifoldMap(mnistCoords[,1:2],
n = 5000,
scale = 0.1,
images = trainData,
xlab = "",
ylab = "")
```

`largeVis`

supports sparse matrices. Besides facilitating very large datasets, this makes it practicable to visualize term-document-matrices directly, and compare the result with the result of visualizing topic vectors.

The `largeVis`

visualization algorithm can be used to visualize undirected weighted or unweighted acyclic graphs. The included `wiki`

dataset is an example.

The following code illustrates how to import and visualize a graph using the YouTube-communities dataset available here. The data and visualization are not included here for size reasons.

```
youtube <- readr::read_tsv(pathToGraphFile, skip=4, col_names=FALSE)
youtube <- as.matrix(youtube)
youtube <- Matrix::sparseMatrix(i = youtube[, 1],
j = youtube[, 2],
x = rep(1, nrow(youtube)),
dims = c(max(youtube), max(youtube)))
youtube <- youtube + t(youtube)
communities <- readr::read_lines(pathToCommunities)
communities <- lapply(communities,
FUN = function(x) as.numeric(unlist(strsplit(x, "\t"))))
community_assignments <- rep(0,
nrow(youtube))
for (i in 1:length(communities)) community_assignments[communities[[i]]] <- i
wij <- buildWijMatrix(youtube)
youTube_coordinates <- projectKNNs(youtube)
youTube_coordinates <- data.frame(scale(t(youTube_coordinates)))
colnames(youTube_coordinates) <- c("x", "y")
youTube_coordinates$community <- factor(community_assignments)
youTube_coordinates$alpha <- factor(ifelse(youTube_coordinates$community == 0, 0.05, 0.2))
ggplot(youTube_coordinates, aes( x = x,
y = y,
color = community,
alpha = alpha,
size = alpha)) +
geom_point() +
scale_alpha_manual(values = c(0.005, 0.2), guide = FALSE) +
scale_size_manual(values = c(0.03, 0.15), guide = FALSE) +
scale_x_continuous("",
breaks = NULL, limits = c(-2.5,2.5)) +
scale_y_continuous("",
breaks = NULL, limits = c(-2.5,2.5)) +
ggtitle("YouTube Communities")
```

The original `LargeVis`

paper used Euclidean distances exclusively. The `largeVis`

package offers a choice between Euclidean and Cosine distance measures.

The implementation is not optimized for cosine distances.

The algorithm is necessarily memory-intensive for large datasets.

A simple way to reduce peak memory usage, is to turn-off the `save_neighbors`

parameter when running `vis`

. If this is insufficient, the steps of the algorithm can be run separately with the `neighborsToVectors`

, `distance`

, and `buildEdgeMatrix`

functions. In this case, the workflow is:

```
neighbors <- randomProjectionTreeSearch(largeDataset)
edges <- buildEdgeMatrix(data = largeDataset, neighbors = neighbors)
rm(neighbors)
gc()
wij <- buildWijMatrix(edges)
rm(edges)
gc()
coords <- projectKNNs(wij)
```

Note that `gc()`

is being called explicitly. The reason is that R will not collect garbage while executing the package’s C++ functions, which can require substantial temporary RAM.

Memory requirements during the neighbor search may be managed by reducing `n_trees`

and increasing the `tree_threshold`

. The decrease in precision is marginal, and may be compensated-for by increasing `max_iters`

. See the benchmarks vignette for further detail.

Tang, Jian, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. 2016. “Visualizing Large-Scale and High-Dimensional Data.” In *Proceedings of the 25th International Conference on World Wide Web*, 287–97. International World Wide Web Conferences Steering Committee.