After Tomas Mikolov et al. released the word2vec tool, there was a boom of articles about word vector representations. One of the best of these articles is Stanford’s GloVe: Global Vectors for Word Representation, which explained why such algorithms work and reformulating word2vec optimizations as a special kind of factorization for word co-occurence matrices.

Here I will briefly introduce the GloVe algorithm and show how to use its text2vec implementation.

THe GloVe algorithm consists of following steps:

Collect word co-occurence statistics in a form of word co-ocurrence matrix \(X\). Each element \(X_{ij}\) of such matrix represents how often word

*i*appears in context of word*j*. Usually we scan our corpus in the following manner: for each term we look for context terms within some area defined by a*window_size*before the term and a*window_size*after the term. Also we give less weight for more distant words, usually using this formula: \[decay = 1/offset\]Define soft constraints for each word pair: \[w_i^Tw_j + b_i + b_j = log(X_{ij})\] Here \(w_i\) - vector for the main word, \(w_j\) - vector for the context word, \(b_i\), \(b_j\) are scalar biases for the main and context words.

Define a cost function \[J = \sum_{i=1}^V \sum_{j=1}^V \; f(X_{ij}) ( w_i^T w_j + b_i + b_j - \log X_{ij})^2\] Here \(f\) is a weighting function which help us to prevent learning only from extremely common word pairs. The GloVe authors choose the following function:

\[ f(X_{ij}) = \begin{cases} (\frac{X_{ij}}{x_{max}})^\alpha & \text{if } X_{ij} < XMAX \\ 1 & \text{otherwise} \end{cases} \]

Now let’s examine how GloVe embeddings works. As commonly known, word2vec word vectors capture many linguistic regularities. To give the canonical example, if we take word vectors for the words “paris,” “france,” and “germany” and perform the following operation:

\[vector("paris") - vector("france") + vector("germany")\]

the resulting vector will be close to the vector for “rome.”

Let’s download the same Wikipedia data used as a demo by word2vec:

```
library(text2vec)
text8_file = "~/text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "~/text8.zip")
unzip ("~/text8.zip", files = "text8", exdir = "~/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)
```

In the next step we will create a vocabulary, a set of words for which we want to learn word vectors. Note, that all of text2vec’s functions which operate on raw text data (`create_vocabulary`

, `create_corpus`

, `create_dtm`

, `create_tcm`

) have a streaming API and you should iterate over tokens as the first argument for these functions.

```
# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
```

These words should not be too uncommon. Fot example we cannot calculate a meaningful word vector for a word which we saw only once in the entire corpus. Here we will take only words which appear at least five times. text2vec provides additional options to filter vocabulary (see `?prune_vocabulary`

).

`vocab <- prune_vocabulary(vocab, term_count_min = 5L)`

Now we have 71,290 terms in the vocabulary and are ready to construct term-co-occurence matrix (TCM).

```
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
```

Now we have a TCM matrix and can factorize it via the GloVe algorithm.

text2vec uses a parallel stochastic gradient descent algorithm. By default it will use all cores on your machine, but you can specify the number of cores if you wish. For example, to use 4 threads call `RcppParallel::setThreadOptions(numThreads = 4)`

.

Let’s fit our model. (It can take several minutes to fit!)

```
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)
# 2016-10-03 10:09:14 - epoch 1, expected cost 0.0893
# 2016-10-03 10:09:17 - epoch 2, expected cost 0.0608
# 2016-10-03 10:09:19 - epoch 3, expected cost 0.0537
# 2016-10-03 10:09:22 - epoch 4, expected cost 0.0499
# 2016-10-03 10:09:25 - epoch 5, expected cost 0.0475
# 2016-10-03 10:09:28 - epoch 6, expected cost 0.0457
# 2016-10-03 10:09:30 - epoch 7, expected cost 0.0443
# 2016-10-03 10:09:33 - epoch 8, expected cost 0.0431
# 2016-10-03 10:09:36 - epoch 9, expected cost 0.0423
# 2016-10-03 10:09:39 - epoch 10, expected cost 0.0415
# 2016-10-03 10:09:42 - epoch 11, expected cost 0.0408
# 2016-10-03 10:09:44 - epoch 12, expected cost 0.0403
# 2016-10-03 10:09:47 - epoch 13, expected cost 0.0400
# 2016-10-03 10:09:50 - epoch 14, expected cost 0.0395
# 2016-10-03 10:09:53 - epoch 15, expected cost 0.0391
# 2016-10-03 10:09:56 - epoch 16, expected cost 0.0388
# 2016-10-03 10:09:59 - epoch 17, expected cost 0.0385
# 2016-10-03 10:10:02 - epoch 18, expected cost 0.0383
# 2016-10-03 10:10:05 - epoch 19, expected cost 0.0380
# 2016-10-03 10:10:08 - epoch 20, expected cost 0.0378
```

Alternatively we can train model with R’s `S3`

interface (but keep in mind that all text2vec models are R6 classes and they are mutable! So `fit`

, `fit_transform`

methods modify models!):

```
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
# `glove` object will be modified by `fit()` call !
fit(tcm, glove, n_iter = 20)
```

And now we get the word vectors:

`word_vectors <- glove$get_word_vectors()`

We can find the closest word vectors for our *paris - france + germany* example:

```
berlin <- word_vectors["paris", , drop = FALSE] -
word_vectors["france", , drop = FALSE] +
word_vectors["germany", , drop = FALSE]
cos_sim = sim2(x = word_vectors, y = berlin, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 5)
# berlin paris munich leipzig germany
# 0.8015347 0.7623165 0.7013252 0.6616945 0.6540700
```

You can achieve much better results by experimenting with `skip_grams_window`

and the parameters of the `GloVe`

class (including word vectors size and the number of iterations). For more details and large-scale experiments on wikipedia data see this old post on my blog.