# Why textmineR?

textmineR was created with three principles in mind:

1. Maximize interoperability within R’s ecosystem
2. Scaleable in terms of object storage and computation time
3. Syntax that is idiomatic to R

R has many packages for text mining and natural language processing (NLP). The CRAN task view on natural language processing lists 53 unique packages. Some of these packages are interoperable. Some are not.

textmineR strives for maximum interoperability in three ways. First, it uses the dgCMatrix class from the popular Matrix package for document term matrices (DTMs) and term co-occurrence matrices (TCMs). The Matrix package is an R “recommended” package with nearly 500 packages that depend, import, or suggest it. Compare that to the slam package used by tm and its derivatives. slam has an order of magnitude fewer dependents. It is simply not as well integrated. Matrix also has methods that make the syntax for manipulating its matrices nearly identical to base R. This greatly reduces the cognitive burden of the programmers.

Second, textmineR relies on base R objects for corpus and metadata storage. Actually, it relies on the user to do so. textmineR’s core functions CreateDtm and CreateTcm take a simple character vector as input. Users may store their corpora as character vectors, lists, or data frames. There is no need to learn a new ‘Corpus’ class.

Third and last, textmineR represents the output of topic models in a consistent way, a list containing two matrices. This is described in more detail in the next section. Several topic models are supported and the simple representation means that textmineR’s utility functions are usable with outputs from other packages, so long as they are represented as matrices of probabilities. (Again, see the next section for more detail.)

textmineR achieves scalability through three means. First, sparse matrices (like the dgCMatrix) offer significant memory savings. Second, textmineR utilizes Rcpp throughout for speedup. Finally, textmineR uses parallel processing by default where possible. textmineR offers a function TmParallelApply which implements a framework for parallel processing that is syntactically agnostic between Windows and Unix-like operating systems. TmParallelApply is used liberally within textmineR and is exposed for users.

textmineR does make some tradeoffs of performance for syntactic simplicity. textmineR is designed to run on a single node in a cluster computing environment. It can (and will by default) use all available cores of that node. If performance is your number one concern, see text2vec. textmineR uses some text2vec under the hood.

textmineR strives for syntax that is idiomatic to R. This is, admittedly, a nebulous concept. textmineR does not create new classes where existing R classes exist. It strives for a functional programming paradigm. And it attempts to group closely-related sequential steps into single functions. This means that users will not have to make several temporary objects along the way. As an example, compare making a document term matrix in textmineR (example below) with tm or text2vec.

As a side note: textmineR’s framework for NLP does not need to be exclusive to textmineR. Text mining packages in R can be interoperable with a few concepts. First, use dgCMatrix for DTMs and TCMs. Second, write most text mining models in a way that they can take a dgCMatrix as the input. Finally, keep non-base R classes to a minimum, especially for corpus and metadata management.

# Corpus management

### Creating a DTM

The basic object of analysis for most text mining applications is a document term matrix, or DTM. This is a matrix where every row represents a document and every column represents a token (word, bi-gram, stem, etc.)

You can create a DTM with textmineR by passing a character vector. There are options for stopword removal, creation of n-grams, and other standard data cleaning. There is an option for passing a stemming or lemmatization function if you desire. (See help(CreateDtm) for an example using Porter’s word stemmer.)

The code below uses a dataset of movie reviews included with the text2vec package. This dataset is used for sentiment analysis. In addition to the text of the reviews. There is a binary variable indicating positive or negative sentiment. More on this later…

library(textmineR)
#>
#> Attaching package: 'textmineR'
#> The following object is masked from 'package:Matrix':
#>
#>     update
#> The following object is masked from 'package:stats':
#>
#>     update

# load movie_review dataset from text2vec
data(movie_review, package = "text2vec")

str(movie_review)
#> 'data.frame':    5000 obs. of  3 variables:
#>  $id : chr "5814_8" "2381_9" "7759_3" "3630_4" ... #>$ sentiment: int  1 1 0 0 1 1 0 0 0 1 ...
#>  $review : chr "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd docu"| __truncated__ "\\\"The Classic War of the Worlds\\\" by Timothy Hines is a very entertaining film that obviously goes to great"| __truncated__ "The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A s"| __truncated__ "It must be assumed that those who praised this film (\\\"the greatest filmed opera ever,\\\" didn't I read some"| __truncated__ ... # let's take a sample so the demo will run quickly # note: textmineR is generally quite scaleable, depending on your system set.seed(123) s <- sample(1:nrow(movie_review), 500) movie_review <- movie_review[ s , ] # create a document term matrix dtm <- CreateDtm(doc_vec = movie_review$review, # character vector of documents
doc_names = movie_review$id, # document names, optional ngram_window = c(1, 2), # minimum and maximum n-gram length stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm stopwords::stopwords(source = "smart")), # this is the default value lower = TRUE, # lowercase - this is the default value remove_punctuation = TRUE, # punctuation - this is the default remove_numbers = TRUE, # numbers - this is the default verbose = FALSE, # Turn off status bar for this demo cpus = 2) # by default, this will be the max number of cpus available Even though a dgCMatrix isn’t a traditional matrix, it has methods that make it similar to standard R matrices. dim(dtm) #> [1] 500 55459 nrow(dtm) #> [1] 500 ncol(dtm) #> [1] 55459 head(colnames(dtm)) #> [1] "making_debut" "bureau_lowly" "scenes_thing" #> [4] "injections" "frying_pan" "renounced_assassin" colnames(dtm) making_debut bureau_lowly scenes_thing injections frying_pan renounced_assassin head(rownames(dtm)) rownames(dtm) 2595_9 8892_2 8620_8 2892_10 232_1 4364_1 # Basic corpus statistics The code below performs some basic corpus statistics. textmineR has a built in function for getting term frequencies across the corpus. This function TermDocFreq gives term frequencies (equivalent to colSums(dtm)), the number of documents in which each term appears (equivalent to colSums(dtm > 0)), and an inverse-document frequency (IDF) vector. The IDF vector can be used to create a TF-IDF matrix.  # get counts of tokens across the corpus tf_mat <- TermDocFreq(dtm = dtm) str(tf_mat) #> 'data.frame': 55459 obs. of 4 variables: #>$ term     : chr  "making_debut" "bureau_lowly" "scenes_thing" "injections" ...
#>  $term_freq: num 1 1 1 1 1 1 1 1 1 1 ... #>$ doc_freq : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $idf : num 6.21 6.21 6.21 6.21 6.21 ... # look at the most frequent tokens head(tf_mat[ order(tf_mat$term_freq, decreasing = TRUE) , ], 10)
Ten most frequent tokens
term term_freq doc_freq idf
br br 2148 312 0.4716049
br_br br_br 1078 312 0.4716049
movie movie 878 310 0.4780358
film film 835 284 0.5656339
good good 333 203 0.9014021
story story 277 167 1.0966143
time time 271 180 1.0216512
great great 195 138 1.2873544
# look at the most frequent bigrams
tf_bigrams <- tf_mat[ stringr::str_detect(tf_mat$term, "_") , ] head(tf_bigrams[ order(tf_bigrams$term_freq, decreasing = TRUE) , ], 10)
Ten most frequent bi-grams
term term_freq doc_freq idf
br_br br_br 1078 312 0.4716049
br_film br_film 48 41 2.5010360
br_movie br_movie 41 36 2.6310892
film_br film_br 32 26 2.9565116
movie_br movie_br 29 27 2.9187712
special_effects special_effects 21 19 3.2701691
good_movie good_movie 16 15 3.5065579
long_time long_time 15 15 3.5065579
high_school high_school 15 10 3.9120230
scooby_doo scooby_doo 15 1 6.2146081

It looks like we have stray html tags (“<br>”) in the documents. These aren’t giving us any relevant information about content. (Except, perhaps, that these documents were originally part of web pages.)

The most intuitive approach, perhaps, is to strip these tags from our documents, re-construct a document term matrix, and re-calculate the objects as above. However, a simpler approach would be to simply remove the tokens containing “br” from the DTM we already calculated. This is much more computationally efficient and gives us the same result anyway.

# remove offending tokens from the DTM
dtm <- dtm[ , ! stringr::str_detect(colnames(dtm),
"(^br$)|(_br$)|(^br_)") ]

# re-construct tf_mat and tf_bigrams
tf_mat <- TermDocFreq(dtm)

tf_bigrams <- tf_mat[ stringr::str_detect(tf_mat$term, "_") , ] head(tf_mat[ order(tf_mat$term_freq, decreasing = TRUE) , ], 10)
#>        term term_freq doc_freq       idf
#> movie movie       878      310 0.4780358
#> film   film       835      284 0.5656339
#> good   good       333      203 0.9014021
#> story story       277      167 1.0966143
#> time   time       271      180 1.0216512
#> great great       195      138 1.2873544
#> watch watch       158      127 1.3704210
#> films films       153       93 1.6820086
Ten most frequent terms, ‘<br>’ removed
term term_freq doc_freq idf
movie movie 878 310 0.4780358
film film 835 284 0.5656339
good good 333 203 0.9014021
story story 277 167 1.0966143
time time 271 180 1.0216512
great great 195 138 1.2873544
watch watch 158 127 1.3704210
films films 153 93 1.6820086
head(tf_bigrams[ order(tf_bigrams$term_freq, decreasing = TRUE) , ], 10) Ten most frequent bi-grams, ‘<br>’ removed term term_freq doc_freq idf special_effects special_effects 21 19 3.270169 good_movie good_movie 16 15 3.506558 long_time long_time 15 15 3.506558 high_school high_school 15 10 3.912023 scooby_doo scooby_doo 15 1 6.214608 low_budget low_budget 15 13 3.649659 watch_movie watch_movie 14 13 3.649659 make_film make_film 14 13 3.649659 years_ago years_ago 14 13 3.649659 movie_good movie_good 13 13 3.649659 We can also calculate how many tokens each document contains from the DTM. Note that this reflects the modifications we made in constructing the DTM (removing stop words, punctuation, numbers, etc.). # summary of document lengths doc_lengths <- rowSums(dtm) summary(doc_lengths) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 23.0 96.0 140.5 186.1 245.0 768.0 Often,it’s useful to prune your vocabulary and remove any tokens that appear in a small number of documents. This will greatly reduce the vocabulary size (see Zipf’s law) and improve computation time. # remove any tokens that were in 3 or fewer documents dtm <- dtm[ , colSums(dtm > 0) > 3 ] # alternatively: dtm[ , tf_mat$term_freq > 3 ]

tf_mat <- tf_mat[ tf_mat$term %in% colnames(dtm) , ] tf_bigrams <- tf_bigrams[ tf_bigrams$term %in% colnames(dtm) , ]

The movie review data set contains more than just text of reviews. It also contains a variable tagging the review as positive (movie_review$sentiment $$=1$$) or negative (movie_review$sentiment $$=0$$). We can examine terms associated with positive and negative reviews. If we wanted, we could use them to build a simple classifier.

However, as we will see immediately below, looking at only the most frequent terms in each category is not helpful. Because of Zipf’s law, the most frequent terms in just about any category will be the same.

# what words are most associated with sentiment?
tf_sentiment <- list(positive = TermDocFreq(dtm[ movie_review$sentiment == 1 , ]), negative = TermDocFreq(dtm[ movie_review$sentiment == 0 , ]))

These are basically the same. Not helpful at all.

head(tf_sentiment$positive[ order(tf_sentiment$positive$term_freq, decreasing = TRUE) , ], 10) Ten most-frequent positive tokens term term_freq doc_freq idf movie movie 358 128 0.5990082 film film 349 125 0.6227247 story story 143 82 1.0443192 good good 138 83 1.0321978 time time 125 82 1.0443192 great great 119 79 1.0815906 watch watch 82 59 1.3735010 love love 71 49 1.5592182 life life 69 49 1.5592182 character character 69 53 1.4807465 head(tf_sentiment$negative[ order(tf_sentiment$negative$term_freq, decreasing = TRUE) , ], 10)
Ten most-frequent negative tokens
term term_freq doc_freq idf
movie movie 520 182 0.3832420
film film 486 159 0.5183445
good good 195 120 0.7997569
time time 146 98 1.0022812
story story 134 85 1.1445974
people people 104 68 1.3677410
acting acting 102 79 1.2178008
make make 89 70 1.3387534

That was unhelpful. Instead, we need to re-weight the terms in each class. We’ll use a probabilistic reweighting, described below.

The most frequent words in each class are proportional to $$P(word|sentiment_j)$$. As we saw above, that would puts the words in the same order as $$P(word)$$, overall. However, we can use the difference in those probabilities to get a new order. That difference is

\begin{align} P(word|sentiment_j) - P(word) \end{align}

You can interpret the difference in (1) as follows: Positive values are more probable in the sentiment class than in the corpus overall. Negative values are less probable. Values close to zero are statistically-independent of sentiment. Since most of the top words are the same when we sort by $$P(word|sentiment_j)$$, these words are statistically-independent of sentiment. They get forced towards zero.

For those paying close attention, this difference should give a similar ordering as pointwise-mutual information (PMI), defined as $$PMI = \frac{P(word|sentiment_j)}{P(word)}$$. However, I prefer the difference as it is bound between $$-1$$ and $$1$$.

The difference method is applied to both words overall and bi-grams in the code below.


# let's reweight by probability by class
p_words <- colSums(dtm) / sum(dtm) # alternatively: tf_mat$term_freq / sum(tf_mat$term_freq)

tf_sentiment$positive$conditional_prob <-
tf_sentiment$positive$term_freq / sum(tf_sentiment$positive$term_freq)

tf_sentiment$positive$prob_lift <- tf_sentiment$positive$conditional_prob - p_words

tf_sentiment$negative$conditional_prob <-
tf_sentiment$negative$term_freq / sum(tf_sentiment$negative$term_freq)

tf_sentiment$negative$prob_lift <- tf_sentiment$negative$conditional_prob - p_words
# let's look again with new weights
head(tf_sentiment$positive[ order(tf_sentiment$positive$prob_lift, decreasing = TRUE) , ], 10) Reweighted: ten most relevant terms for positive sentiment term term_freq doc_freq idf conditional_prob prob_lift great great 119 79 1.081591 0.0081168 0.0022971 heart heart 42 17 2.617825 0.0028647 0.0015217 story story 143 82 1.044319 0.0097538 0.0014868 life life 69 49 1.559218 0.0047064 0.0012444 excellent excellent 38 33 1.954531 0.0025919 0.0012191 beautiful beautiful 39 28 2.118834 0.0026601 0.0011977 find find 51 41 1.737466 0.0034786 0.0009418 world world 49 38 1.813452 0.0033422 0.0008950 watch watch 82 59 1.373501 0.0055931 0.0008776 years years 60 43 1.689838 0.0040925 0.0008693 head(tf_sentiment$negative[ order(tf_sentiment$negative$prob_lift, decreasing = TRUE) , ], 10)
Reweighted: ten most relevant terms for negative sentiment
term term_freq doc_freq idf conditional_prob prob_lift
movie movie 520 182 0.3832420 0.0275921 0.0013886
people people 104 68 1.3677410 0.0055184 0.0011313
worst worst 55 48 1.7160476 0.0029184 0.0011277
script script 62 50 1.6752257 0.0032898 0.0009023
acting acting 102 79 1.2178008 0.0054123 0.0008759
film film 486 159 0.5183445 0.0257880 0.0008678
guy guy 55 33 2.0907411 0.0029184 0.0008591
thing thing 66 56 1.5618970 0.0035021 0.0007564
awful awful 35 24 2.4091948 0.0018572 0.0007529
# what about bi-grams?
tf_sentiment_bigram <- lapply(tf_sentiment, function(x){
x <- x[ stringr::str_detect(x$term, "_") , ] x[ order(x$prob_lift, decreasing = TRUE) , ]
})
head(tf_sentiment_bigram$positive, 10) Reweighted: ten most relevant bigrams for positive sentiment term term_freq doc_freq idf conditional_prob prob_lift highly_recommend highly_recommend 11 11 3.053143 0.0007503 0.0003922 big_screen big_screen 8 5 3.841601 0.0005457 0.0002771 real_life real_life 9 8 3.371597 0.0006139 0.0002557 world_war world_war 8 5 3.841601 0.0005457 0.0002174 watched_movie watched_movie 7 7 3.505128 0.0004775 0.0002089 enjoy_watching enjoy_watching 6 6 3.659279 0.0004092 0.0002003 years_ago years_ago 9 8 3.371597 0.0006139 0.0001961 makes_movie makes_movie 5 5 3.841601 0.0003410 0.0001918 loved_movie loved_movie 5 5 3.841601 0.0003410 0.0001918 movie_worth movie_worth 6 6 3.659279 0.0004092 0.0001705 head(tf_sentiment_bigram$negative, 10)
Reweighted: ten most relevant bigrams for negative sentiment
term term_freq doc_freq idf conditional_prob prob_lift
good_thing good_thing 11 11 3.189353 0.0005837 0.0002554
waste_time waste_time 12 11 3.189353 0.0006367 0.0002488