Getting Started with quanteda

This vignette provides a basic overview of quanteda’s features and capabilities. For additional vignettes, see the articles at quanteda.io.

Introduction

An R package for managing and analyzing text.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, by performing the most common natural language processing tasks simply and quickly, such as tokenizing, stemming, or forming ngrams. quanteda’s functions for tokenizing texts and forming multiple tokenized documents into a document-feature matrix are both extremely fast and extremely simple to use. quanteda can segment texts easily by words, paragraphs, sentences, or even user-supplied delimiters and tags.

Built on the text processing functions in the stringi package, which is in turn built on C++ implementation of the ICU libraries for Unicode text handling, quanteda pays special attention to fast and correct implementation of Unicode and the handling of text in any character set, following conversion internally to UTF-8.

quanteda is built for efficiency and speed, through its design around three infrastructures: the stringi package for text processing, the data.table package for indexing large documents efficiently, and the Matrix package for sparse matrix objects. If you can fit it into memory, quanteda will handle it quickly. (And eventually, we will make it possible to process objects even larger than available memory.)

quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined “thesaurus”, and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.

quanteda Features

Corpus management tools

The tools for getting texts into a corpus object include:

The tools for working with a corpus include:

Natural-Language Processing tools

For extracting features from a corpus, quanteda provides the following tools:

Document-Feature Matrix analysis tools

For analyzing the resulting document-feature matrix created when features are abstracted from a corpus, quanteda provides:

Additional and planned features

Additional features of quanteda include:

Planned features coming soon to quanteda are:

Working with other text analysis packages

quanteda is hardly unique in providing facilities for working with text – the excellent tm package already provides many of the features we have described. quanteda is designed to complement those packages, as well to simplify the implementation of the text-to-analysis workflow. quanteda corpus structures are simpler objects than in tms, as are the document-feature matrix objects from quanteda, compared to the sparse matrix implementation found in tm. However, there is no need to choose only one package, since we provide translator functions from one matrix or corpus object to the other in quanteda.

Once constructed, a quanteda “dfm”" can be easily passed to other text-analysis packages for additional analysis of topic models or scaling, such as:

How to Install

Through a normal installation of the package from CRAN, or for the GitHub version, see the installation instructions at https://github.com/kbenoit/quanteda.

Creating and Working with a Corpus

require(quanteda)
## Loading required package: quanteda
## quanteda version 0.99
## Using 4 of 8 threads for parallel computing
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

Currently available corpus sources

quanteda has a simple and powerful companion package for loading texts: readtext. The main function in this package, readtext(), takes a file or fileset from disk or a URL, and returns a type of data.frame that can be used directly with the corpus() constructor function, to create a quanteda corpus object.

readtext() works on:

The corpus constructor command corpus() works directly on:

Example: building a corpus from a character vector

The simplest case is to create a corpus from a vector of texts already in memory in R. This gives the advanced R user complete flexbility with his or her choice of text inputs, as there are almost endless ways to get a vector of texts into R.

If we already have the texts in this form, we can call the corpus constructor function directly. We can demonstrate this on the built-in character object of the texts about immigration policy extracted from the 2010 election manifestos of the UK political parties (called data_char_ukimmig2010).

myCorpus <- corpus(data_char_ukimmig2010)  # build a new corpus from the texts
summary(myCorpus)
## Corpus consisting of 9 documents.
## 
##          Text Types Tokens Sentences
##           BNP  1125   3280        88
##     Coalition   142    260         4
##  Conservative   251    499        15
##        Greens   322    679        21
##        Labour   298    683        29
##        LibDem   251    483        14
##            PC    77    114         5
##           SNP    88    134         4
##          UKIP   346    723        27
## 
## Source:  /private/var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T/RtmpO6LYBd/Rbuild54e15ac171b/quanteda/vignettes/* on x86_64 by kbenoit
## Created: Thu Aug 10 18:28:28 2017
## Notes:

If we wanted, we could add some document-level variables – what quanteda calls docvars – to this corpus.

We can do this using the R’s names() function to get the names of the character vector data_char_ukimmig2010, and assign this to a document variable (docvar).

docvars(myCorpus, "Party") <- names(data_char_ukimmig2010)
docvars(myCorpus, "Year") <- 2010
summary(myCorpus)
## Corpus consisting of 9 documents.
## 
##          Text Types Tokens Sentences        Party Year
##           BNP  1125   3280        88          BNP 2010
##     Coalition   142    260         4    Coalition 2010
##  Conservative   251    499        15 Conservative 2010
##        Greens   322    679        21       Greens 2010
##        Labour   298    683        29       Labour 2010
##        LibDem   251    483        14       LibDem 2010
##            PC    77    114         5           PC 2010
##           SNP    88    134         4          SNP 2010
##          UKIP   346    723        27         UKIP 2010
## 
## Source:  /private/var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T/RtmpO6LYBd/Rbuild54e15ac171b/quanteda/vignettes/* on x86_64 by kbenoit
## Created: Thu Aug 10 18:28:28 2017
## Notes:

If we wanted to tag each document with additional meta-data not considered a document variable of interest for analysis, but rather something that we need to know as an attribute of the document, we could also add those to our corpus.

metadoc(myCorpus, "language") <- "english"
metadoc(myCorpus, "docsource")  <- paste("data_char_ukimmig2010", 1:ndoc(myCorpus), sep = "_")
summary(myCorpus, showmeta = TRUE)
## Corpus consisting of 9 documents.
## 
##          Text Types Tokens Sentences        Party Year _language
##           BNP  1125   3280        88          BNP 2010   english
##     Coalition   142    260         4    Coalition 2010   english
##  Conservative   251    499        15 Conservative 2010   english
##        Greens   322    679        21       Greens 2010   english
##        Labour   298    683        29       Labour 2010   english
##        LibDem   251    483        14       LibDem 2010   english
##            PC    77    114         5           PC 2010   english
##           SNP    88    134         4          SNP 2010   english
##          UKIP   346    723        27         UKIP 2010   english
##               _docsource
##  data_char_ukimmig2010_1
##  data_char_ukimmig2010_2
##  data_char_ukimmig2010_3
##  data_char_ukimmig2010_4
##  data_char_ukimmig2010_5
##  data_char_ukimmig2010_6
##  data_char_ukimmig2010_7
##  data_char_ukimmig2010_8
##  data_char_ukimmig2010_9
## 
## Source:  /private/var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T/RtmpO6LYBd/Rbuild54e15ac171b/quanteda/vignettes/* on x86_64 by kbenoit
## Created: Thu Aug 10 18:28:28 2017
## Notes:

The last command, metadoc, allows you to define your own document meta-data fields. Note that in assiging just the single value of "english", R has recycled the value until it matches the number of documents in the corpus. In creating a simple tag for our custom metadoc field docsource, we used the quanteda function ndoc() to retrieve the number of documents in our corpus. This function is deliberately designed to work in a way similar to functions you may already use in R, such as nrow() and ncol().

Example: loading in files using the readtext package

require(readtext)

# Twitter json
mytf1 <- readtext("~/Dropbox/QUANTESS/social media/zombies/tweets.json")
myCorpusTwitter <- corpus(mytf1)
summary(myCorpusTwitter, 5)
# generic json - needs a textfield specifier
mytf2 <- readtext("~/Dropbox/QUANTESS/Manuscripts/collocations/Corpora/sotu/sotu.json",
                  textfield = "text")
summary(corpus(mytf2), 5)
# text file
mytf3 <- readtext("~/Dropbox/QUANTESS/corpora/project_gutenberg/pg2701.txt", cache = FALSE)
summary(corpus(mytf3), 5)
# multiple text files
mytf4 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", cache = FALSE)
summary(corpus(mytf4), 5)
# multiple text files with docvars from filenames
mytf5 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", 
                  docvarsfrom = "filenames", sep = "-", docvarnames = c("Year", "President"))
summary(corpus(mytf5), 5)
# XML data
mytf6 <- readtext("~/Dropbox/QUANTESS/quanteda_working_files/xmlData/plant_catalog.xml", 
                  textfield = "COMMON")
summary(corpus(mytf6), 5)
# csv file
write.csv(data.frame(inaugSpeech = texts(data_corpus_inaugural), 
                     docvars(data_corpus_inaugural)),
          file = "/tmp/inaug_texts.csv", row.names = FALSE)
mytf7 <- readtext("/tmp/inaug_texts.csv", textfield = "inaugSpeech")
summary(corpus(mytf7), 5)

How a quanteda corpus works

Corpus principles

A corpus is designed to be a “library” of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level. We have a special name for document-level meta-data: docvars. These are variables or features that describe attributes of each document.

A corpus is designed to be a more or less static container of texts with respect to processing and analysis. This means that the texts in corpus are not designed to be changed internally through (for example) cleaning or pre-processing steps, such as stemming or removing punctuation. Rather, texts can be extracted from the corpus as part of processing, and assigned to new objects, but the idea is that the corpus will remain as an original reference copy so that other analyses – for instance those in which stems and punctuation were required, such as analyzing a reading ease index – can be performed on the same corpus.

To extract texts from a a corpus, we use an extractor, called texts().

texts(data_corpus_inaugural)[2]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1793-Washington 
## "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n "

To summarize the texts from a corpus, we can call a summary() method defined for a corpus.

summary(data_corpus_irishbudget2010)
## Corpus consisting of 14 documents.
## 
##                                   Text Types Tokens Sentences year debate
##        2010_BUDGET_01_Brian_Lenihan_FF  1953   8641       374 2010 BUDGET
##       2010_BUDGET_02_Richard_Bruton_FG  1040   4446       217 2010 BUDGET
##         2010_BUDGET_03_Joan_Burton_LAB  1624   6393       307 2010 BUDGET
##        2010_BUDGET_04_Arthur_Morgan_SF  1595   7107       343 2010 BUDGET
##          2010_BUDGET_05_Brian_Cowen_FF  1629   6599       250 2010 BUDGET
##           2010_BUDGET_06_Enda_Kenny_FG  1148   4232       153 2010 BUDGET
##      2010_BUDGET_07_Kieran_ODonnell_FG   678   2297       133 2010 BUDGET
##       2010_BUDGET_08_Eamon_Gilmore_LAB  1181   4177       201 2010 BUDGET
##     2010_BUDGET_09_Michael_Higgins_LAB   488   1286        44 2010 BUDGET
##        2010_BUDGET_10_Ruairi_Quinn_LAB   439   1284        59 2010 BUDGET
##      2010_BUDGET_11_John_Gormley_Green   401   1030        49 2010 BUDGET
##        2010_BUDGET_12_Eamon_Ryan_Green   510   1643        90 2010 BUDGET
##      2010_BUDGET_13_Ciaran_Cuffe_Green   442   1240        45 2010 BUDGET
##  2010_BUDGET_14_Caoimhghin_OCaolain_SF  1188   4044       176 2010 BUDGET
##  number      foren     name party
##      01      Brian  Lenihan    FF
##      02    Richard   Bruton    FG
##      03       Joan   Burton   LAB
##      04     Arthur   Morgan    SF
##      05      Brian    Cowen    FF
##      06       Enda    Kenny    FG
##      07     Kieran ODonnell    FG
##      08      Eamon  Gilmore   LAB
##      09    Michael  Higgins   LAB
##      10     Ruairi    Quinn   LAB
##      11       John  Gormley Green
##      12      Eamon     Ryan Green
##      13     Ciaran    Cuffe Green
##      14 Caoimhghin OCaolain    SF
## 
## Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Wed Jun 28 22:04:18 2017
## Notes:

We can save the output from the summary command as a data frame, and plot some basic descriptive statistics with this information:

tokenInfo <- summary(data_corpus_inaugural)
## Corpus consisting of 58 documents.
## 
##             Text Types Tokens Sentences Year  President       FirstName
##  1789-Washington   625   1538        23 1789 Washington          George
##  1793-Washington    96    147         4 1793 Washington          George
##       1797-Adams   826   2578        37 1797      Adams            John
##   1801-Jefferson   717   1927        41 1801  Jefferson          Thomas
##   1805-Jefferson   804   2381        45 1805  Jefferson          Thomas
##     1809-Madison   535   1263        21 1809    Madison           James
##     1813-Madison   541   1302        33 1813    Madison           James
##      1817-Monroe  1040   3680       121 1817     Monroe           James
##      1821-Monroe  1259   4886       129 1821     Monroe           James
##       1825-Adams  1003   3152        74 1825      Adams     John Quincy
##     1829-Jackson   517   1210        25 1829    Jackson          Andrew
##     1833-Jackson   499   1269        29 1833    Jackson          Andrew
##    1837-VanBuren  1315   4165        95 1837  Van Buren          Martin
##    1841-Harrison  1896   9144       210 1841   Harrison   William Henry
##        1845-Polk  1334   5193       153 1845       Polk      James Knox
##      1849-Taylor   496   1179        22 1849     Taylor         Zachary
##      1853-Pierce  1165   3641       104 1853     Pierce        Franklin
##    1857-Buchanan   945   3086        89 1857   Buchanan           James
##     1861-Lincoln  1075   4006       135 1861    Lincoln         Abraham
##     1865-Lincoln   360    776        26 1865    Lincoln         Abraham
##       1869-Grant   485   1235        40 1869      Grant      Ulysses S.
##       1873-Grant   552   1475        43 1873      Grant      Ulysses S.
##       1877-Hayes   831   2716        59 1877      Hayes   Rutherford B.
##    1881-Garfield  1021   3212       111 1881   Garfield        James A.
##   1885-Cleveland   676   1820        44 1885  Cleveland          Grover
##    1889-Harrison  1352   4722       157 1889   Harrison        Benjamin
##   1893-Cleveland   821   2125        58 1893  Cleveland          Grover
##    1897-McKinley  1232   4361       130 1897   McKinley         William
##    1901-McKinley   854   2437       100 1901   McKinley         William
##   1905-Roosevelt   404   1079        33 1905  Roosevelt        Theodore
##        1909-Taft  1437   5822       159 1909       Taft  William Howard
##      1913-Wilson   658   1882        68 1913     Wilson         Woodrow
##      1917-Wilson   549   1656        59 1917     Wilson         Woodrow
##     1921-Harding  1169   3721       148 1921    Harding       Warren G.
##    1925-Coolidge  1220   4440       196 1925   Coolidge          Calvin
##      1929-Hoover  1090   3865       158 1929     Hoover         Herbert
##   1933-Roosevelt   743   2062        85 1933  Roosevelt     Franklin D.
##   1937-Roosevelt   725   1997        96 1937  Roosevelt     Franklin D.
##   1941-Roosevelt   526   1544        68 1941  Roosevelt     Franklin D.
##   1945-Roosevelt   275    647        26 1945  Roosevelt     Franklin D.
##      1949-Truman   781   2513       116 1949     Truman        Harry S.
##  1953-Eisenhower   900   2757       119 1953 Eisenhower       Dwight D.
##  1957-Eisenhower   621   1931        92 1957 Eisenhower       Dwight D.
##     1961-Kennedy   566   1566        52 1961    Kennedy         John F.
##     1965-Johnson   568   1723        93 1965    Johnson   Lyndon Baines
##       1969-Nixon   743   2437       103 1969      Nixon Richard Milhous
##       1973-Nixon   544   2012        68 1973      Nixon Richard Milhous
##      1977-Carter   527   1376        52 1977     Carter           Jimmy
##      1981-Reagan   902   2790       128 1981     Reagan          Ronald
##      1985-Reagan   925   2921       123 1985     Reagan          Ronald
##        1989-Bush   795   2681       141 1989       Bush          George
##     1993-Clinton   642   1833        81 1993    Clinton            Bill
##     1997-Clinton   773   2449       111 1997    Clinton            Bill
##        2001-Bush   621   1808        97 2001       Bush       George W.
##        2005-Bush   773   2319       100 2005       Bush       George W.
##       2009-Obama   938   2711       110 2009      Obama          Barack
##       2013-Obama   814   2317        88 2013      Obama          Barack
##       2017-Trump   582   1660        88 2017      Trump       Donald J.
## 
## Source:  Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes:   http://www.presidency.ucsb.edu/inaugurals.php
if (require(ggplot2))
    ggplot(data=tokenInfo, aes(x = Year, y = Tokens, group = 1)) + geom_line() + geom_point() +
        scale_x_discrete(labels = c(seq(1789,2012,12)), breaks = seq(1789,2012,12) ) 
## Loading required package: ggplot2


# Longest inaugural address: William Henry Harrison
tokenInfo[which.max(tokenInfo$Tokens), ] 
##                        Text Types Tokens Sentences Year President
## 1841-Harrison 1841-Harrison  1896   9144       210 1841  Harrison
##                   FirstName
## 1841-Harrison William Henry

Tools for handling corpus objects

Adding two corpus objects together

The + operator provides a simple method for concatenating two corpus objects. If they contain different sets of document-level variables, these will be stitched together in a fashion that guarantees that no information is lost. Corpus-level medata data is also concatenated.

library(quanteda)
mycorpus1 <- corpus(data_corpus_inaugural[1:5], note = "First five inaug speeches.")
## Warning in corpus.character(data_corpus_inaugural[1:5], note = "First five
## inaug speeches."): Argument note not used.
mycorpus2 <- corpus(data_corpus_inaugural[53:58], note = "Last five inaug speeches.")
## Warning in corpus.character(data_corpus_inaugural[53:58], note = "Last five
## inaug speeches."): Argument note not used.
mycorpus3 <- mycorpus1 + mycorpus2
summary(mycorpus3)
## Corpus consisting of 11 documents.
## 
##             Text Types Tokens Sentences
##  1789-Washington   625   1538        23
##  1793-Washington    96    147         4
##       1797-Adams   826   2578        37
##   1801-Jefferson   717   1927        41
##   1805-Jefferson   804   2381        45
##     1997-Clinton   773   2449       111
##        2001-Bush   621   1808        97
##        2005-Bush   773   2319       100
##       2009-Obama   938   2711       110
##       2013-Obama   814   2317        88
##       2017-Trump   582   1660        88
## 
## Source:  Combination of corpuses mycorpus1 and mycorpus2
## Created: Thu Aug 10 18:28:29 2017
## Notes:

subsetting corpus objects

There is a method of the corpus_subset() function defined for corpus objects, where a new corpus can be extracted based on logical conditions applied to docvars:

summary(corpus_subset(data_corpus_inaugural, Year > 1990))
## Corpus consisting of 7 documents.
## 
##          Text Types Tokens Sentences Year President FirstName
##  1993-Clinton   642   1833        81 1993   Clinton      Bill
##  1997-Clinton   773   2449       111 1997   Clinton      Bill
##     2001-Bush   621   1808        97 2001      Bush George W.
##     2005-Bush   773   2319       100 2005      Bush George W.
##    2009-Obama   938   2711       110 2009     Obama    Barack
##    2013-Obama   814   2317        88 2013     Obama    Barack
##    2017-Trump   582   1660        88 2017     Trump Donald J.
## 
## Source:  Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes:   http://www.presidency.ucsb.edu/inaugurals.php
summary(corpus_subset(data_corpus_inaugural, President == "Adams"))
## Corpus consisting of 2 documents.
## 
##        Text Types Tokens Sentences Year President   FirstName
##  1797-Adams   826   2578        37 1797     Adams        John
##  1825-Adams  1003   3152        74 1825     Adams John Quincy
## 
## Source:  Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes:   http://www.presidency.ucsb.edu/inaugurals.php

Exploring corpus texts

The kwic function (keywords-in-context) performs a search for a word and allows us to view the contexts in which it occurs:

options(width = 200)
kwic(data_corpus_inaugural, "terror")
##                                                                                                       
##     [1797-Adams, 1325]              fraud or violence, by | terror | , intrigue, or venality          
##  [1933-Roosevelt, 112] nameless, unreasoning, unjustified | terror | which paralyzes needed efforts to
##  [1941-Roosevelt, 287]      seemed frozen by a fatalistic | terror | , we proved that this            
##    [1961-Kennedy, 866]    alter that uncertain balance of | terror | that stays the hand of           
##     [1981-Reagan, 813]     freeing all Americans from the | terror | of runaway living costs.         
##   [1997-Clinton, 1055]        They fuel the fanaticism of | terror | . And they torment the           
##   [1997-Clinton, 1655]  maintain a strong defense against | terror | and destruction. Our children    
##     [2009-Obama, 1632]     advance their aims by inducing | terror | and slaughtering innocents, we
kwic(data_corpus_inaugural, "terror", valuetype = "regex")
##                                                                                                               
##     [1797-Adams, 1325]                   fraud or violence, by |  terror   | , intrigue, or venality          
##  [1933-Roosevelt, 112]      nameless, unreasoning, unjustified |  terror   | which paralyzes needed efforts to
##  [1941-Roosevelt, 287]           seemed frozen by a fatalistic |  terror   | , we proved that this            
##    [1961-Kennedy, 866]         alter that uncertain balance of |  terror   | that stays the hand of           
##    [1961-Kennedy, 990]               of science instead of its |  terrors  | . Together let us explore        
##     [1981-Reagan, 813]          freeing all Americans from the |  terror   | of runaway living costs.         
##    [1981-Reagan, 2196]        understood by those who practice | terrorism | and prey upon their neighbors    
##   [1997-Clinton, 1055]             They fuel the fanaticism of |  terror   | . And they torment the           
##   [1997-Clinton, 1655]       maintain a strong defense against |  terror   | and destruction. Our children    
##     [2009-Obama, 1632]          advance their aims by inducing |  terror   | and slaughtering innocents, we   
##     [2017-Trump, 1117] civilized world against radical Islamic | terrorism | , which we will eradicate
kwic(data_corpus_inaugural, "communist*")
##                                                                                              
##   [1949-Truman, 834] the actions resulting from the | Communist  | philosophy are a threat to
##  [1961-Kennedy, 519]             -- not because the | Communists | may be doing it,

In the above summary, Year and President are variables associated with each document. We can access such variables with the docvars() function.

# inspect the document-level variables
head(docvars(data_corpus_inaugural))
##                 Year  President FirstName
## 1789-Washington 1789 Washington    George
## 1793-Washington 1793 Washington    George
## 1797-Adams      1797      Adams      John
## 1801-Jefferson  1801  Jefferson    Thomas
## 1805-Jefferson  1805  Jefferson    Thomas
## 1809-Madison    1809    Madison     James

# inspect the corpus-level metadata
metacorpus(data_corpus_inaugural)
## $source
## [1] "Gerhard Peters and John T. Woolley. The American Presidency Project."
## 
## $notes
## [1] "http://www.presidency.ucsb.edu/inaugurals.php"
## 
## $created
## [1] "Tue Jun 13 14:51:47 2017"

More corpora are available from the quantedaData package.

Extracting Features from a Corpus

In order to perform statistical analysis such as document scaling, we must extract a matrix associating values for certain features with each document. In quanteda, we use the dfm function to produce such a matrix. “dfm” is short for document-feature matrix, and always refers to documents in rows and “features” as columns. We fix this dimensional orientation because is is standard in data analysis to have a unit of analysis as a row, and features or variables pertaining to each unit as columns. We call them “features” rather than terms, because features are more general than terms: they can be defined as raw terms, stemmed terms, the parts of speech of terms, terms after stopwords have been removed, or a dictionary class to which a term belongs. Features can be entirely general, such as ngrams or syntactic dependencies, and we leave this open-ended.

Tokenizing texts

To simply tokenize a text, quanteda provides a powerful command called tokens(). This produces an intermediate object, consisting of a list of tokens in the form of character vectors, where each element of the list corresponds to an input document.

tokens() is deliberately conservative, meaning that it does not remove anything from the text unless told to do so.

txt <- c(text1 = "This is $10 in 999 different ways,\n up and down; left and right!", 
         text2 = "@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokens(txt)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "10"        "in"        "999"       "different" "ways"      ","         "up"        "and"       "down"      ";"         "left"      "and"       "right"    
## [17] "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"             "#quanteda"      "2day"           "4ever"          ","              "http"           ":"              "/"             
## [12] "/"              "textasdata.com" "?"              "page"           "="              "123"            "."
tokens(txt, remove_numbers = TRUE,  remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "in"        "different" "ways"      "up"        "and"       "down"      "left"      "and"       "right"    
## 
## text2 :
## [1] "@kenbenoit"     "working"        "on"             "#quanteda"      "2day"           "4ever"          "http"           "textasdata.com" "page"
tokens(txt, remove_numbers = FALSE, remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "10"        "in"        "999"       "different" "ways"      "up"        "and"       "down"      "left"      "and"       "right"    
## 
## text2 :
##  [1] "@kenbenoit"     "working"        "on"             "#quanteda"      "2day"           "4ever"          "http"           "textasdata.com" "page"           "123"
tokens(txt, remove_numbers = TRUE,  remove_punct = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "in"        "different" "ways"      ","         "up"        "and"       "down"      ";"         "left"      "and"       "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"             "#quanteda"      "2day"           "4ever"          ","              "http"           ":"              "/"             
## [12] "/"              "textasdata.com" "?"              "page"           "="              "."
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "10"        "in"        "999"       "different" "ways"      ","         "up"        "and"       "down"      ";"         "left"      "and"       "right"    
## [17] "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"             "#quanteda"      "2day"           "4ever"          ","              "http"           ":"              "/"             
## [12] "/"              "textasdata.com" "?"              "page"           "="              "123"            "."
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE, remove_separators = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      " "         "is"        " "         "$"         "10"        " "         "in"        " "         "999"       " "         "different" " "         "ways"      ","         "\n"       
## [17] " "         "up"        " "         "and"       " "         "down"      ";"         " "         "left"      " "         "and"       " "         "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     " "              "working"        ":"              " "              "on"             " "              "#quanteda"      " "              "2day"           "\t"            
## [12] "4ever"          ","              " "              "http"           ":"              "/"              "/"              "textasdata.com" "?"              "page"           "="             
## [23] "123"            "."

We also have the option to tokenize characters:

tokens("Great website: http://textasdata.com?page=123.", what = "character")
## tokens from 1 document.
## text1 :
##  [1] "G" "r" "e" "a" "t" "w" "e" "b" "s" "i" "t" "e" ":" "h" "t" "t" "p" ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" "a" "g" "e" "=" "1" "2" "3" "."
tokens("Great website: http://textasdata.com?page=123.", what = "character", 
         remove_separators = FALSE)
## tokens from 1 document.
## text1 :
##  [1] "G" "r" "e" "a" "t" " " "w" "e" "b" "s" "i" "t" "e" ":" " " "h" "t" "t" "p" ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" "a" "g" "e" "=" "1" "2" "3" "."

and sentences:

# sentence level         
tokens(c("Kurt Vongeut said; only assholes use semi-colons.", 
           "Today is Thursday in Canberra:  It is yesterday in London.", 
           "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"), 
          what = "sentence")
## tokens from 3 documents.
## text1 :
## [1] "Kurt Vongeut said; only assholes use semi-colons."
## 
## text2 :
## [1] "Today is Thursday in Canberra:  It is yesterday in London."
## 
## text3 :
## [1] "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"

Constructing a document-frequency matrix

Tokenizing texts is an intermediate option, and most users will want to skip straight to constructing a document-feature matrix. For this, we have a Swiss-army knife function, called dfm(), which performs tokenization and tabulates the extracted features into a matrix of documents by features. Unlike the conservative approach taken by tokens(), the dfm() function applies certain options by default, such as toLower() – a separate function for lower-casing texts – and removes punctuation. All of the options to tokens() can be passed to dfm(), however.

myCorpus <- corpus_subset(data_corpus_inaugural, Year > 1990)

# make a dfm
myDfm <- dfm(myCorpus)
myDfm[, 1:5]
## Document-feature matrix of: 7 documents, 5 features (0% sparse).
## 7 x 5 sparse Matrix of class "dfmSparse"
##               features
## docs           my fellow citizens   , today
##   1993-Clinton  7      5        2 139    10
##   1997-Clinton  6      7        7 131     5
##   2001-Bush     3      1        9 110     2
##   2005-Bush     2      3        6 120     3
##   2009-Obama    2      1        1 130     6
##   2013-Obama    3      3        6  99     4
##   2017-Trump    1      1        4  96     4

Other options for a dfm() include removing stopwords, and stemming the tokens.

# make a dfm, removing stopwords and applying stemming
myStemMat <- dfm(myCorpus, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
myStemMat[, 1:5]
## Document-feature matrix of: 7 documents, 5 features (17.1% sparse).
## 7 x 5 sparse Matrix of class "dfmSparse"
##               features
## docs           fellow citizen today celebr mysteri
##   1993-Clinton      5       2    10      4       1
##   1997-Clinton      7       8     6      1       0
##   2001-Bush         1      10     2      0       0
##   2005-Bush         3       7     3      2       0
##   2009-Obama        1       1     6      2       0
##   2013-Obama        3       8     6      1       0
##   2017-Trump        1       4     5      3       1

The option remove provides a list of tokens to be ignored. Most users will supply a list of pre-defined “stop words”, defined for numerous languages, accessed through the stopwords() function:

head(stopwords("english"), 20)
##  [1] "i"          "me"         "my"         "myself"     "we"         "our"        "ours"       "ourselves"  "you"        "your"       "yours"      "yourself"   "yourselves" "he"         "him"       
## [16] "his"        "himself"    "she"        "her"        "hers"
head(stopwords("russian"), 10)
##  [1] "и"   "в"   "во"  "не"  "что" "он"  "на"  "я"   "с"   "со"
head(stopwords("arabic"), 10)
##  [1] "فى"  "في"  "كل"  "لم"  "لن"  "له"  "من"  "هو"  "هي"  "قوة"

Viewing the document-frequency matrix

The dfm can be inspected in the Enviroment pane in RStudio, or by calling R’s View function. Calling plot on a dfm will display a wordcloud using the wordcloud package

mydfm <- dfm(data_char_ukimmig2010, remove = stopwords("english"), remove_punct = TRUE)
mydfm
## Document-feature matrix of: 9 documents, 1,547 features (83.8% sparse).

To access a list of the most frequently occurring features, we can use topfeatures():

topfeatures(mydfm, 20)  # 20 top words
## immigration     british      people      asylum     britain          uk      system  population     country         new  immigrants      ensure       shall citizenship      social    national 
##          66          37          35          29          28          27          27          21          20          19          17          17          17          16          14          14 
##         bnp     illegal        work     percent 
##          13          13          13          12

Plotting a word cloud is done using textplot_wordcloud(), for a dfm class object. This function passes arguments through to wordcloud() from the wordcloud package, and can prettify the plot using the same arguments:

set.seed(100)
textplot_wordcloud(mydfm, min.freq = 6, random.order = FALSE,
                   rot.per = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

Grouping documents by document variable

Often, we are interested in analysing how texts differ according to substantive factors which may be encoded in the document variables, rather than simply by the boundaries of the document files. We can group documents which share the same value for a document variable when creating a dfm:

byPartyDfm <- dfm(data_corpus_irishbudget2010, groups = "party", remove = stopwords("english"), remove_punct = TRUE)

We can sort this dfm, and inspect it:

sort(byPartyDfm)[, 1:10]
## Warning: 'sort.dfm' is deprecated.
## Use 'dfm_sort' instead.
## See help("Deprecated")
## Document-feature matrix of: 5 documents, 10 features (0% sparse).
## 5 x 10 sparse Matrix of class "dfmSparse"
##        features
## docs    people budget government public minister tax economy pay jobs billion
##   FF        23     44         47     65       11  60      37  41   41      32
##   FG        78     71         61     47       62  11      20  29   17      21
##   LAB       69     66         36     32       54  47      37  24   20      34
##   SF        81     53         73     31       39  34      50  24   27      29
##   Green     15     26         19      4        4  11      16   4   15       3

Note that the most frequently occurring feature is “will”, a word usually on English stop lists, but one that is not included in quanteda’s built-in English stopword list.

Grouping words by dictionary or equivalence class

For some applications we have prior knowledge of sets of words that are indicative of traits we would like to measure from the text. For example, a general list of positive words might indicate positive sentiment in a movie review, or we might have a dictionary of political terms which are associated with a particular ideological stance. In these cases, it is sometimes useful to treat these groups of words as equivalent for the purposes of analysis, and sum their counts into classes.

For example, let’s look at how words associated with terrorism and words associated with the economy vary by President in the inaugural speeches corpus. From the original corpus, we select Presidents since Clinton:

recentCorpus <- corpus_subset(data_corpus_inaugural, Year > 1991)

Now we define a demonstration dictionary:

myDict <- dictionary(list(terror = c("terrorism", "terrorists", "threat"),
                          economy = c("jobs", "business", "grow", "work")))

We can use the dictionary when making the dfm:

byPresMat <- dfm(recentCorpus, dictionary = myDict)
byPresMat
## Document-feature matrix of: 7 documents, 2 features (14.3% sparse).
## 7 x 2 sparse Matrix of class "dfmSparse"
##               features
## docs           terror economy
##   1993-Clinton      0       8
##   1997-Clinton      1       8
##   2001-Bush         0       4
##   2005-Bush         1       6
##   2009-Obama        1      10
##   2013-Obama        1       6
##   2017-Trump        1       5

The constructor function dictionary() also works with two common “foreign” dictionary formats: the LIWC and Provalis Research’s Wordstat format. For instance, we can load the LIWC and apply this to the Presidential inaugural speech corpus:

liwcdict <- dictionary(file = "~/Dropbox/QUANTESS/dictionaries/LIWC/LIWC2001_English.dic",
                       format = "LIWC")
liwcdfm <- dfm(data_corpus_inaugural[52:58], dictionary = liwcdict, verbose = FALSE)
liwcdfm[, 1:10]

Further Examples

Similarities between texts

presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), 
               remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
obamaSimil <- textstat_simil(presDfm, c("2009-Obama" , "2013-Obama"), 
                             margin = "documents", method = "cosine")
obamaSimil
##              2009-Obama 2013-Obama
## 2009-Obama    1.0000000  0.6815711
## 2013-Obama    0.6815711  1.0000000
## 1981-Reagan   0.6229949  0.6376412
## 1985-Reagan   0.6434472  0.6629428
## 1989-Bush     0.6253944  0.5784290
## 1993-Clinton  0.6280946  0.6265428
## 1997-Clinton  0.6593018  0.6466030
## 2001-Bush     0.6018113  0.6193608
## 2005-Bush     0.5266249  0.5867178
## 2017-Trump    0.5192075  0.5160104
# dotchart(as.list(obamaSimil)$"2009-Obama", xlab = "Cosine similarity")

We can use these distances to plot a dendrogram, clustering presidents:

data(data_corpus_SOTU, package = "quantedaData")
presDfm <- dfm(corpus_subset(data_corpus_SOTU, Date > as.Date("1980-01-01")), 
               stem = TRUE, remove_punct = TRUE,
               remove = stopwords("english"))
presDfm <- dfm_trim(presDfm, min_count = 5, min_docfreq = 3)
# hierarchical clustering - get distances on normalized dfm
presDistMat <- textstat_dist(dfm_weight(presDfm, "relfreq"))
# hiarchical clustering the distance object
presCluster <- hclust(presDistMat)
# label with document names
presCluster$labels <- docnames(presDfm)
# plot as a dendrogram
plot(presCluster, xlab = "", sub = "", main = "Euclidean Distance on Normalized Token Frequency")

(try it!)

We can also look at term similarities:

sim <- textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine", margin = "features")
lapply(as.list(sim), head, 10)
## $fair
##   economi     begin jefferson    author     faith      call   struggl      best     creat    courag 
## 0.9080252 0.9075951 0.8981462 0.8944272 0.8866586 0.8608285 0.8451543 0.8366600 0.8347300 0.8326664 
## 
## $health
##     shape   generat     wrong    common  knowledg    planet      task    demand       eye     defin 
## 0.9045340 0.8971180 0.8944272 0.8888889 0.8888889 0.8819171 0.8728716 0.8666667 0.8660254 0.8642416 
## 
## $terror
##    potenti  adversari commonplac     miracl     racial     bounti     martin      dream      polit   guarante 
##  0.9036961  0.9036961  0.8944272  0.8944272  0.8944272  0.8944272  0.8944272  0.8624394  0.8500000  0.8485281

Scaling document positions

We have a lot of development work to do on the textmodel() function, but here is a demonstration of unsupervised document scaling comparing the “wordfish” model:

# make prettier document names
ieDfm <- dfm(data_corpus_irishbudget2010)
textmodel(ieDfm, model = "wordfish", dir = c(2, 1))
## Fitted wordfish model:
## Call:
##  textmodel_wordfish.dfm(x = x, dir = ..1)
## 
## Estimated document positions:
## 
##                                Documents      theta         SE       lower       upper
## 1        2010_BUDGET_01_Brian_Lenihan_FF  1.8209548 0.02032349  1.78112076  1.86078883
## 2       2010_BUDGET_02_Richard_Bruton_FG -0.5932769 0.02818832 -0.64852597 -0.53802775
## 3         2010_BUDGET_03_Joan_Burton_LAB -1.1136739 0.01540253 -1.14386288 -1.08348497
## 4        2010_BUDGET_04_Arthur_Morgan_SF -0.1219338 0.02846315 -0.17772156 -0.06614601
## 5          2010_BUDGET_05_Brian_Cowen_FF  1.7724233 0.02364101  1.72608694  1.81875970
## 6           2010_BUDGET_06_Enda_Kenny_FG -0.7145779 0.02650249 -0.76652278 -0.66263303
## 7      2010_BUDGET_07_Kieran_ODonnell_FG -0.4844816 0.04171470 -0.56624240 -0.40272079
## 8       2010_BUDGET_08_Eamon_Gilmore_LAB -0.5616724 0.02967344 -0.61983237 -0.50351250
## 9     2010_BUDGET_09_Michael_Higgins_LAB -0.9703099 0.03850532 -1.04578029 -0.89483943
## 10       2010_BUDGET_10_Ruairi_Quinn_LAB -0.9589222 0.03892364 -1.03521255 -0.88263190
## 11     2010_BUDGET_11_John_Gormley_Green  1.1807220 0.07221472  1.03918113  1.32226282
## 12       2010_BUDGET_12_Eamon_Ryan_Green  0.1866451 0.06294116  0.06328043  0.31000978
## 13     2010_BUDGET_13_Ciaran_Cuffe_Green  0.7421882 0.07245441  0.60017757  0.88419885
## 14 2010_BUDGET_14_Caoimhghin_OCaolain_SF -0.1840848 0.03666249 -0.25594329 -0.11222633
## 
## Estimated feature scores: showing first 30 beta-hats for features
## 
##            when               i       presented             the   supplementary          budget              to            this           house            last           april               , 
##     -0.09916819      0.38806041      0.39882660      0.25598049      1.11589792      0.09919078      0.37011548      0.30697151      0.19910796      0.28975340     -0.09522679      0.34538935 
##            said              we           could            work             our             way         through          period              of          severe        economic        distress 
##     -0.71927451      0.47995942     -0.52972864      0.58230920      0.74377002      0.33615251      0.65986240      0.55625435      0.33935953      1.27914150      0.47870695      1.84457405 
##               .           today             can          report            that notwithstanding 
##      0.27356243      0.17423655      0.36382153      0.69179921      0.08837233      1.84457405

Topic models

quanteda makes it very easy to fit topic models as well, e.g.:

quantdfm <- dfm(data_corpus_irishbudget2010, 
                remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english"))
quantdfm <- dfm_trim(quantdfm, min_count = 4, max_docfreq = 10, verbose = TRUE)
## Removing features occurring:
##   - fewer than 4 times: 3,427
##   - in more than 10 documents: 72
##   Total features removed: 3,499 (73.5%).
quantdfm
## Document-feature matrix of: 14 documents, 1,263 features (64.5% sparse).

if (require(topicmodels)) {
    myLDAfit20 <- LDA(convert(quantdfm, to = "topicmodels"), k = 20)
    get_terms(myLDAfit20, 5)
}
## Loading required package: topicmodels
##      Topic 1        Topic 2     Topic 3     Topic 4   Topic 5    Topic 6    Topic 7     Topic 8   Topic 9    Topic 10     Topic 11     Topic 12    Topic 13    Topic 14      Topic 15   Topic 16  
## [1,] "society"      "review"    "fáil"      "child"   "hit"      "system"   "ministers" "care"    "fianna"   "welfare"    "support"    "taoiseach" "taoiseach" "alternative" "welfare"  "measures"
## [2,] "enterprising" "reduction" "taoiseach" "benefit" "policies" "taxation" "irish"     "per"     "fáil"     "million"    "employment" "employees" "gael"      "citizenship" "system"   "spending"
## [3,] "sense"        "reduce"    "irish"     "day"     "rate"     "child"    "strong"    "workers" "national" "investment" "energy"     "rate"      "fine"      "wealth"      "stimulus" "million" 
## [4,] "equal"        "million"   "evidence"  "bank"    "welfare"  "welfare"  "start"     "child"   "support"  "back"       "million"    "referred"  "may"       "adjustment"  "fianna"   "level"   
## [5,] "nation"       "increases" "choice"    "today"   "system"   "fáil"     "know"      "fianna"  "irish"    "confidence" "adjustment" "debate"    "stimulus"  "breaks"      "taken"    "scheme"  
##      Topic 17  Topic 18  Topic 19    Topic 20  
## [1,] "benefit" "levy"    "side"      "failed"  
## [2,] "irish"   "million" "kind"      "strategy"
## [3,] "say"     "carbon"  "education" "needed"  
## [4,] "burden"  "change"  "fianna"    "vision"  
## [5,] "savings" "welfare" "fáil"      "system"