Converting to and from Document-Term Matrix and Corpus objects

Julia Silge and David Robinson

2017-06-19

Tidying document-term matrices

Many existing text mining datasets are in the form of a DocumentTermMatrix class (from the tm package). For example, consider the corpus of 2246 Associated Press articles from the topicmodels package:

library(tm)
data("AssociatedPress", package = "topicmodels")
AssociatedPress
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

If we want to analyze this with tidy tools, we need to turn it into a one-term-per-document-per-row data frame first. The tidy function does this. (For more on the tidy verb, see the broom package).

library(dplyr)
library(tidytext)

ap_td <- tidy(AssociatedPress)

Just as shown in this vignette, having the text in this format is convenient for analysis with the tidytext package. For example, you can perform sentiment analysis on these newspaper articles.

ap_sentiments <- ap_td %>%
  inner_join(get_sentiments("bing"), by = c(term = "word"))

ap_sentiments
## # A tibble: 30,094 x 4
##    document    term count sentiment
##       <int>   <chr> <dbl>     <chr>
##  1        1 assault     1  negative
##  2        1 complex     1  negative
##  3        1   death     1  negative
##  4        1    died     1  negative
##  5        1    good     2  positive
##  6        1 illness     1  negative
##  7        1  killed     2  negative
##  8        1    like     2  positive
##  9        1   liked     1  positive
## 10        1 miracle     1  positive
## # ... with 30,084 more rows

We can find the most negative documents:

library(tidyr)

ap_sentiments %>%
  count(document, sentiment, wt = count) %>%
  ungroup() %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  arrange(sentiment)
## # A tibble: 2,190 x 4
##    document negative positive sentiment
##       <int>    <dbl>    <dbl>     <dbl>
##  1     1251       54        6       -48
##  2     1380       53        5       -48
##  3      531       51        9       -42
##  4       43       45       11       -34
##  5     1263       44       10       -34
##  6     2178       40        6       -34
##  7      334       45       12       -33
##  8     1664       38        5       -33
##  9     2147       47       14       -33
## 10      516       38        6       -32
## # ... with 2,180 more rows

Or visualize which words contributed to positive and negative sentiment:

library(ggplot2)

ap_sentiments %>%
  count(sentiment, term, wt = count) %>%
  ungroup() %>%
  filter(n >= 150) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(term = reorder(term, n)) %>%
  ggplot(aes(term, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab("Contribution to sentiment")

Note that a tidier is also available for the dfm class from the quanteda package:

library(methods)

data("data_corpus_inaugural", package = "quanteda")
d <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE)

d
## Document-feature matrix of: 58 documents, 9,232 features (91.6% sparse).
tidy(d)
## # A tibble: 44,725 x 3
##           document   term count
##              <chr>  <chr> <dbl>
##  1 1789-Washington fellow     3
##  2 1793-Washington fellow     1
##  3      1797-Adams fellow     3
##  4  1801-Jefferson fellow     7
##  5  1805-Jefferson fellow     8
##  6    1809-Madison fellow     1
##  7    1813-Madison fellow     1
##  8     1817-Monroe fellow     6
##  9     1821-Monroe fellow    10
## 10      1825-Adams fellow     3
## # ... with 44,715 more rows

Casting tidy text data into a DocumentTermMatrix

Some existing text mining tools or algorithms work only on sparse document-term matrices. Therefore, tidytext provides cast_ verbs for converting from a tidy form to these matrices.

ap_td
## # A tibble: 302,031 x 3
##    document       term count
##       <int>      <chr> <dbl>
##  1        1     adding     1
##  2        1      adult     2
##  3        1        ago     1
##  4        1    alcohol     1
##  5        1  allegedly     1
##  6        1      allen     1
##  7        1 apparently     2
##  8        1   appeared     1
##  9        1   arrested     1
## 10        1    assault     1
## # ... with 302,021 more rows
# cast into a Document-Term Matrix
ap_td %>%
  cast_dtm(document, term, count)
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)
# cast into a Term-Document Matrix
ap_td %>%
  cast_tdm(term, document, count)
## <<TermDocumentMatrix (terms: 10473, documents: 2246)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)
# cast into quanteda's dfm
ap_td %>%
  cast_dfm(term, document, count)
## Document-feature matrix of: 10,473 documents, 2,246 features (98.7% sparse).
# cast into a Matrix object
m <- ap_td %>%
  cast_sparse(document, term, count)
class(m)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
dim(m)
## [1]  2246 10473

This allows for easy reading, filtering, and processing to be done using dplyr and other tidy tools, after which the data can be converted into a document-term matrix for machine learning applications.

Tidying corpus data

You can also tidy Corpus objects from the tm package. For example, consider a Corpus containing 20 documents, one for each

reut21578 <- system.file("texts", "crude", package = "tm")
reuters <- VCorpus(DirSource(reut21578),
                   readerControl = list(reader = readReut21578XMLasPlain))

reuters
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 20

The tidy verb creates a table with one row per document:

reuters_td <- tidy(reuters)
reuters_td
## # A tibble: 20 x 17
##                        author       datetimestamp description
##                         <chr>              <dttm>       <chr>
##  1                       <NA> 1987-02-26 10:00:56            
##  2 BY TED D'AFFLISIO, Reuters 1987-02-26 10:34:11            
##  3                       <NA> 1987-02-26 11:18:00            
##  4                       <NA> 1987-02-26 11:21:01            
##  5                       <NA> 1987-02-26 12:00:57            
##  6                       <NA> 1987-02-28 20:25:46            
##  7   By Jeremy Clift, Reuters 1987-02-28 20:39:14            
##  8                       <NA> 1987-02-28 22:27:27            
##  9                       <NA> 1987-03-01 01:22:30            
## 10                       <NA> 1987-03-01 11:31:44            
## 11                       <NA> 1987-03-01 18:05:49            
## 12                       <NA> 1987-03-02 00:39:23            
## 13                       <NA> 1987-03-02 00:43:22            
## 14                       <NA> 1987-03-02 00:43:41            
## 15                       <NA> 1987-03-02 01:25:42            
## 16                       <NA> 1987-03-02 04:20:05            
## 17                       <NA> 1987-03-02 04:28:26            
## 18                       <NA> 1987-03-02 05:13:46            
## 19 By BERNICE NAPACH, Reuters 1987-03-02 07:38:34            
## 20                       <NA> 1987-03-02 07:49:06            
## # ... with 14 more variables: heading <chr>, id <chr>, language <chr>,
## #   origin <chr>, topics <chr>, lewissplit <chr>, cgisplit <chr>,
## #   oldid <chr>, topics_cat <list>, places <list>, people <chr>,
## #   orgs <chr>, exchanges <chr>, text <chr>

Similarly, you can tidy a corpus object from the quanteda package:

library(quanteda)

data("data_corpus_inaugural")

data_corpus_inaugural
## Corpus consisting of 58 documents and 3 docvars.
inaug_td <- tidy(data_corpus_inaugural)
inaug_td
## # A tibble: 58 x 4
##                                                                           text
##  *                                                                       <chr>
##  1 "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmon
##  2 "Fellow citizens, I am again called upon by the voice of my country to exec
##  3 "When it was first perceived, in early times, that no middle course for Ame
##  4 "Friends and Fellow Citizens:\n\nCalled upon to undertake the duties of the
##  5 "Proceeding, fellow citizens, to that qualification which the Constitution 
##  6 "Unwilling to depart from examples of the most revered authority, I avail m
##  7 "About to add the solemnity of an oath to the obligations imposed by a seco
##  8 "I should be destitute of feeling if I was not deeply affected by the stron
##  9 "Fellow citizens, I shall not attempt to describe the grateful emotions whi
## 10 "In compliance with an usage coeval with the existence of our Federal Const
## # ... with 48 more rows, and 3 more variables: Year <dbl>,
## #   President <chr>, FirstName <chr>

This lets us work with tidy tools like unnest_tokens to analyze the text alongside the metadata.

inaug_words <- inaug_td %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

inaug_words
## # A tibble: 50,156 x 4
##     Year President FirstName          word
##    <dbl>     <chr>     <chr>         <chr>
##  1  2017     Trump Donald J.         bleed
##  2  2017     Trump Donald J.     mysteries
##  3  2017     Trump Donald J.   complaining
##  4  2017     Trump Donald J.   unstoppable
##  5  2017     Trump Donald J.    solidarity
##  6  2017     Trump Donald J.       bedrock
##  7  2017     Trump Donald J.      goodwill
##  8  2017     Trump Donald J.       issuing
##  9  2017     Trump Donald J. redistributed
## 10  2017     Trump Donald J.        ripped
## # ... with 50,146 more rows

We could then, for example, see how the appearance of a word changes over time:

inaug_freq <- inaug_words %>%
  count(Year, word) %>%
  ungroup() %>%
  complete(Year, word, fill = list(n = 0)) %>%
  group_by(Year) %>%
  mutate(year_total = sum(n),
         percent = n / year_total) %>%
  ungroup()

inaug_freq
## # A tibble: 501,990 x 5
##     Year        word     n year_total     percent
##    <dbl>       <chr> <dbl>      <dbl>       <dbl>
##  1  1789           1     0        529 0.000000000
##  2  1789       1,000     0        529 0.000000000
##  3  1789         100     0        529 0.000000000
##  4  1789 100,000,000     0        529 0.000000000
##  5  1789 120,000,000     0        529 0.000000000
##  6  1789         125     0        529 0.000000000
##  7  1789          13     0        529 0.000000000
##  8  1789        14th     1        529 0.001890359
##  9  1789        15th     0        529 0.000000000
## 10  1789          16     0        529 0.000000000
## # ... with 501,980 more rows

For example, we can use the broom package to perform logistic regression on each word.

models <- inaug_freq %>%
  group_by(word) %>%
  filter(sum(n) > 50) %>%
  do(tidy(glm(cbind(n, year_total - n) ~ Year, .,
              family = "binomial"))) %>%
  ungroup() %>%
  filter(term == "Year")

models
## # A tibble: 114 x 6
##              word  term     estimate   std.error statistic      p.value
##             <chr> <chr>        <dbl>       <dbl>     <dbl>        <dbl>
##  1            act  Year  0.006364883 0.002152252  2.957312 3.103335e-03
##  2         action  Year  0.002092966 0.001900141  1.101479 2.706882e-01
##  3 administration  Year -0.006674172 0.001839091 -3.629061 2.844538e-04
##  4        america  Year  0.020006138 0.001543509 12.961464 2.023476e-38
##  5       american  Year  0.008182894 0.001273539  6.425321 1.315916e-10
##  6      americans  Year  0.031598925 0.003457040  9.140456 6.218973e-20
##  7      authority  Year -0.005852677 0.002316016 -2.527045 1.150266e-02
##  8       business  Year  0.003318870 0.001986348  1.670840 9.475334e-02
##  9         called  Year -0.002218557 0.002065965 -1.073860 2.828854e-01
## 10        century  Year  0.015493277 0.002416870  6.410471 1.450704e-10
## # ... with 104 more rows
models %>%
  filter(term == "Year") %>%
  arrange(desc(abs(estimate)))
## # A tibble: 114 x 6
##         word  term    estimate    std.error statistic      p.value
##        <chr> <chr>       <dbl>        <dbl>     <dbl>        <dbl>
##  1 americans  Year  0.03159892 0.0034570404  9.140456 6.218973e-20
##  2   america  Year  0.02000614 0.0015435091 12.961464 2.023476e-38
##  3   century  Year  0.01549328 0.0024168702  6.410471 1.450704e-10
##  4      live  Year  0.01401278 0.0024189653  5.792883 6.918816e-09
##  5       god  Year  0.01389399 0.0018661441  7.445293 9.672943e-14
##  6 democracy  Year  0.01346352 0.0023279003  5.783546 7.314201e-09
##  7     earth  Year  0.01294880 0.0022275185  5.813104 6.132502e-09
##  8   freedom  Year  0.01283668 0.0012847736  9.991393 1.662290e-23
##  9    powers  Year -0.01231688 0.0019732102 -6.242050 4.318728e-10
## 10     world  Year  0.01200071 0.0009741139 12.319614 7.103384e-35
## # ... with 104 more rows

You can show these models as a volcano plot, which compares the effect size with the significance:

library(ggplot2)

models %>%
  mutate(adjusted.p.value = p.adjust(p.value)) %>%
  ggplot(aes(estimate, adjusted.p.value)) +
  geom_point() +
  scale_y_log10() +
  geom_text(aes(label = word), vjust = 1, hjust = 1,
            check_overlap = TRUE) +
  xlab("Estimated change over time") +
  ylab("Adjusted p-value")

We can also use the ggplot2 package to display the top 6 terms that have changed in frequency over time.

library(scales)

models %>%
  top_n(6, abs(estimate)) %>%
  inner_join(inaug_freq) %>%
  ggplot(aes(Year, percent)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~ word) +
  scale_y_continuous(labels = percent_format()) +
  ylab("Frequency of word in speech")