The corpustools package offers various tools for anayzing text corpora. What sets it appart from other text analysis packages is that it focuses on the use of a
tokenlist format for storing tokenized texts. By a tokenlist we mean a data.frame in which each token (i.e. word) of a text is a row, and columns contain information about each token. The advantage of this approach is that all information from the full text is preserved, and more information can be added. This format can also be used to work with the output from natural language processing pipelines such as SpaCy, UDpipe and Stanford CoreNLP. Furthermore, by preserving the full text information, it is possible to reconstruct texts with annotations from text analysis techniques. This enables qualitative analysis and manual validation of the results of computational text analysis methods (e.g., highlighted search results or dictionary terms, coloring words based on topicmodels).
The problem is that the tokenlist format quickly leads to huge data.frames that can be difficult to manage. This is where corpustools comes in. The backbone of corpustools is the
tCorpus class (i.e. tokenlist corpus), that builds on the
data.table packages to work efficiently with huge tokenlists. corpustools provides functions to create the
tcorpus from full text, apply basic and advanced text preprocessing, and to use various analysis and visualization techniques.
An example application that combines functionalities could be as follows. Given full-text data, you create a tcorpus. With the built-in search functionalities you use a Lucene style Boolean query to find where in these texts certain issues are discussed. You subset the tcorpus to focus you analysis on the text within 100 words of the search results. You now train a topicmodel on this data, and annotate the
tcorpus with the wordassignments (i.e. the topics assigned to individual words). This information can then be used in other analyses or visualizations, and a topic browser can be created in which full texts are shown with the topic words highlighted.
This document explains how to create and use a
tcorpus. For a quick reference, you can also access the documentation hub from within R.
A tcorpus consists of two data.tables (i.e. enhanced data.frames supported by the data.table package).
There are two ways to create a tcorpus: create a tcorpus from full-text or by importing a tokenlist.
create_tcorpus function creates a tcorpus from full-text input. The full text can be provided as a single character vector or as a data.frame in which the text is given in one of the columns. We recommend using the data.frame approach, because this automatically imports all other columns as document meta.
As an example we have provided the
sotu_texts demo data. This is a data.frame in which each rows represents a paragraph of the state of the union speeches from Bush and Obama
##  "id" "date" "party" "text" "president"
We can pass
sotu_texts to the
create_tcorpus function. Here we also need to specify which column contain the text (text_columns), and which column contains the document id (doc_column). Note that multiple text columns can be given.
tc shows the number of tokens (i.e. words) in the tcorpus, the number of documents, and the columns in the
## tCorpus containing 90827 tokens ## grouped by documents (n = 1090) ## contains: ## - 3 columns in $tokens: doc_id, token_id, token ## - 4 columns in $meta: doc_id, date, party, president
We can also look at the tokens and meta data directly (for changing this data, please read the
Managing the tcorpus section below).
## doc_id token_id token ## 1: 111541965 1 It ## 2: 111541965 2 is ## 3: 111541965 3 our ## 4: 111541965 4 unfinished ## 5: 111541965 5 task ## 6: 111541965 6 to
## doc_id date party president ## 1: 111541965 2013-02-12 Democrats Barack Obama ## 2: 111541995 2013-02-12 Democrats Barack Obama ## 3: 111542001 2013-02-12 Democrats Barack Obama ## 4: 111542006 2013-02-12 Democrats Barack Obama ## 5: 111542013 2013-02-12 Democrats Barack Obama ## 6: 111542018 2013-02-12 Democrats Barack Obama
The create_tcorpus function has some additional parameters.
split_sentencesargument can be set to TRUE to perform sentence boundary detection. This adds the
sentencecolumn to the tokens data.table, which can be used in seveal techniques in corpustools.
max_tokensparameters can be set to limit the tcorpus to contain only the first x sentences/tokens of each text.
udpipepackage to tokenize the texts with a natural language processing pipeline. This is discussed in the section on
If you already have a tokenlist it can be imported as a tcorpus with the
tokens_to_tcorpus function. The tokenlist has to be formatted as a data.frame. As an example we provide the corenlp_tokens demo data, which is the output of the Stanford CoreNLP parser.
This type of data.frame can be passed to the
tokens_to_tcorpus function. The names of the columns that describe the token positions (document_id, sentence and token_id) also need to be specified.
## tCorpus containing 36 tokens ## grouped by documents (n = 1) and sentences (n = 6) ## contains: ## - 13 columns in $tokens: doc_id, sentence, token_id, token, lemma, CharacterOffsetBegin, CharacterOffsetEnd, POS, NER, Speaker, parent, relation, pos1 ## - 1 column in $meta: doc_id
To include document meta data, a separate data.frame with meta data has to be passed to the
meta argument. This data.frame needs to have a column with the document id, using the name specified in
Most operations for modifying data (e.g., preprocessing, rule-based annotations) have specialized functions, but in some cases it’s necessary to manually modify the tCorpus. In particular:
While it is possible to directly work in the
tc$meta data.tables, it is recommended to use the special accessor methods (e.g., set, subset). The tCorpus makes several assumptions about how the data is structured in order to work fast, and directly editing the tokens and meta data can break the tCorpus. Using the accessor functions should ensure that this doesn’t happen. Also, the accessor methods make use of the modify by reference semantics of the
To add, remove or mutate columns in the
tc$meta data.tables, the
tc$set_meta() methods can be used (also see
?tCorpus_data. These methods change the tCorpus by reference (for more details about R6 methods and changing by reference, see the section
Using the tCorpus R6 methods.).
set method (
?set) has two mandatory arguments:
column argument takes the name of the column to add or mutate, and the
value argument takes the value to assign to the column.
## doc_id token_id token new_column ## 1: 1 1 This any value ## 2: 1 2 is any value ## 3: 1 3 an any value ## 4: 1 4 example any value
value can also be an expression, which will be evaluated in the $tokens data.table. Here we overwrite the new_column with the values of the token column in uppercase.
subset argument can be used to only mutate values for the rows where the subset expression evaluates to TRUE. For example, the following operation changes the new_column values to lowercase if token_id <= 2.
Set cannot be used to change a column name, so for this we have the
To remove columns, you can use the
To add, remove and mutate columns in the $meta data.table, the
delete_meta_columns methods can be used.
subset function can be used to subset the tCorpus on both the tokens and meta data. To subset on tokens, the subset argument can be used.
##  90827
##  23233
To subset on meta, the subset_meta argument can be used.
##  45956
As an alternative, subsetting can also be performed with the $subset() R6 method. This will modify the tCorpus by reference, which is faster and more memory efficient.
##  12274
Now tc itself has been subsetted, and only contains tokens of the first sentence of Obama documents.
A common problem in text analysis is that there can be duplicate or near duplicate documents in a corpus. For example, a corpus of news articles might have multiple versions of the same news article if it has been updated. It is often appropriate to delete such duplicates. The tCorpus has a flexible deduplicate method that can delete duplicates based on document similarity.
Document similarity is calculated using a similarity measure for the weighted frequencies of words (or any selected feature) in documents. By default, this is the cosine similarity of token frequencies with a tf-idf weighting scheme. To ease computation a minimum document frequency (default is 2) and maximum document frequency percentage (default is 50%) is used.
It is often possible to limit comparisons by including meta data about date and categories. For example, duplicate articles in newspapers should occur within a limited time frame and within the same newspaper. The meta_cols argument can be used to only compare within documents with identical meta values, and the date_col can be used to only compare documents within a the time interval specified in hour_window. Also, by using date_col, it can be specified whether the duplicate with the first or last date should be removed.
d = data.frame(doc_id = paste('doc', 1:3), text = c('I am a text','I am also text', 'Ich bin ein Berliner'), date = as.POSIXct(c('2010-01-01','2010-01-02','2010-01-03'))) # document 1 and 2 are duplicates, and the first is deleted tc = create_tcorpus(d) tc$deduplicate(feature='token', date_col = 'date', similarity = 0.75, verbose = F) tc$meta$doc_id
##  "doc 1" "doc 3"
##  "doc 2" "doc 3"
Splitting texts into a tokenlist is a form of text preprocessing called tokenization. For many text analysis techniques it is furthermore necessary (or at least strongly recommended) to use additional preprocessing techniques (see e.g., Welbers, van Atteveldt & Benoit, 2017; Denny & Spirling, 2018).
Corpustools support various preprocessing techniques. We make a rough distinction between basic and advanced preprocessing. By basic preprocessing we mean the common lightweight techniques such as stemming, lowercasing and stopword removal. By advanced preprocessing we refer to techniques such as lemmatization, part-of-speech tagging and dependency parsing, that require more sophisticated NLP pipelines. Also, it is often usefull to filter out certain terms. For instance, you might want to look only at nouns, drop all terms that occur less than 10 times in the corpus, or perhaps use more stopwords.
The basic preprocessing techniques can be performed on a
tcorpus with the
preprocess method. The main arguments are:
Lowercasing and removing punctuation is so common that its the default. In the following example we also apply stemming and stopword removal. Note that since we work with English data we do not need to set the language, but if other languages are used the language argument needs to be used. Also note that we do not specify
tcorpus created with
create_tcorpus always has a “token” column with the raw token text.
## doc_id token_id token feature ## 1: 111541965 1 It <NA> ## 2: 111541965 2 is <NA> ## 3: 111541965 3 our <NA> ## 4: 111541965 4 unfinished unfinish ## 5: 111541965 5 task task ## --- ## 90823: 111552781 3 continue continu ## 90824: 111552781 4 to <NA> ## 90825: 111552781 5 bless bless ## 90826: 111552781 6 America america ## 90827: 111552781 7 . <NA>
The tokens data.table now has the new column “feature” which contains the preprocessed token.
There are several packages for using NLP pipelines in R, such as
spacyr for SpaCy,
coreNLP for Stanford CoreNLP, and
udpipe for UDPipe. Most of these rely on external dependencies (Python for SpaCy, Java for CoreNLP), but the
UDPipe pipeline runs in C++ which plays nicely with R. We have built a wrapper for the
udpipe package which provides bindings for using UDPipe directly from within R. Unlike basic preprocessing, this feature has to be used in the create_tcorpus function, because it requires the raw text input.
To use a UDPipe model in create_tcorpus, you simply need to use the
udpipe_model argument. The value for this argument is the language/name of the model you want to use (if you don’t know the name, just give the language and you’ll get suggestions). The first time using a language, R will automatically download the model. By default the udpipe model is stored in the current working directory, in a folder called udpipe_models, but you can set the path with the
By default, UDPipe will not perform dependency parsing. You can activate this part of the pipeline by using the
use_parser = TRUE argument, but it takes longer to compute and not all language have good dependency parser performance.
feats column contains a bunch of nested features, such as Person (first, second or third) and Number (singular or plural). If you don’t need this, its good to remove the colum (
tc$delete_colums('feats')). If you want to use (some of) this information, you can use the
tc$feats_to_columns method to cast these nested features to columns. Here we specifically select the Tense, Person and Number features. Note that the default behavior is to also delete the feats column.
NLP Parsing lots of articles can take quite a while, and you don’t want to spend several hours parsing just to see that you crashed your R session and lost all progress. By default, create_tcorpus therefore keeps a persistent cache of the most recent three unique uses of create_tcorpus with udpipe (i.e. calling create_tcorpus with the same input and arguments). If you like to keep more caches, you can set the udpipe_cache argument.
Thus, if you run the create_tcorpus function (using udpipe) a second time with the same input and parameters, it will continue where you last left of, or load all data from cache if it already finished the first time. So if your R session crashes or you need to shut down your laptop, simply fire create_tcorpus up again next time.
If you are parsing a lot of data and you have CPU cores to spare, you can use them to parse the data faster. Simply set the udpipe_cores argument in create_tcorpus. The following example processes the sotu_texts data with four cores (if you have them).
In basic preprocessing we saw that
NA values were introduced where stopwords were filtered out. If you want to filter out more tokens, you can use the
tc$feature_subset() method. This works as a regular subset, but the difference is that the corpus is kept intact. That is, the rows of the tokens that are filtered out are not deleted.
As a (silly) example, we can specifically filter out the token with token_id 5. For this we take the subset of tokens that is not 5.
This sets all values in the ‘feature’ column with token_id 5 to NA. This is just an example that is easy to run. A more usefull application is to use this to subset on part-of-speech tags.
It is often usefull to filter out tokens based on frequency. For instance, to only include words in a model that are not too rare to be interesting and not too common to be informative. The feature subset function has four arguments to do so:
freq indicates how often a term occured in the corpus, and
docfreq indicates the number of unique documents in which a term occured. For example, the following code filters out all tokens that occur less than 10 times.
For the sake of convenience, the min/max functions are also integrated in the
preprocess method. So it would also have been possible to use
tc$preprocess(use_stemming = T, remove_stopwords=T, min_freq=10). Also, if you want to inspect the freq and docfreq of terms, you could use the
feature_stats(tc, 'feature') function.
The data in a tCorpus can easily be converted to a document-term matrix (dtm), or to the document-feature matrix (dfm) class of the quanteda package.
In general, we recommmend using the quanteda dfm class, which can be created from a tCorpus by using the
get_dfm function. Alternatively, the
get_dtm function can be used to create a regular sparse matrix (dgTMatrix from the Matrix package) or the DocumentTermMatrix class from the tm package is also supported. Here we only demonstrate
Here we first preprocess the tokens, creating the
feature column, and then create a dfm with these features.
## Document-feature matrix of: 1,090 documents, 1,422 features (97.82% sparse) and 3 docvars. ## features ## docs 1 10 100 11 11th 12 15 18 19 1990s ## 111541965 0 0 0 0 0 0 0 0 0 0 ## 111541995 0 0 0 0 0 0 0 0 0 0 ## 111542001 0 0 0 0 0 0 0 0 0 0 ## 111542006 0 0 0 0 0 0 0 0 0 0 ## 111542013 0 0 0 0 0 0 0 0 0 0 ## 111542018 0 0 0 0 0 0 0 0 0 0 ## [ reached max_ndoc ... 1,084 more documents, reached max_nfeat ... 1,412 more features ]
The get_dfm function has several usefull parameters. We won’t show each in detail, but list some applications.
## Use sentences as rows instead of documents dfm_sent = get_dfm(tc, 'feature', context_level = 'sentence') ## Use a weighting scheme. dfm_weighted = get_dfm(tc, 'feature', weight = 'tfidf') ## Only use a subset of the tCorpus dfm_obama = get_dfm(tc, 'feature', subset_meta = president == "Barack Obama")
By preprocessing the tokens within the tCorpus, the features in the dfm can be linked to the original tokens in the tCorpus. In the next section,
Keeping the full corpus intact, we show how this can be used to annotate the tCorpus with results from bag-of-words style text analyses.
In the tCorpus the original tokens are still intact after preprocessing. This is not very memory efficient, so why do we do it?
Keeping the corpus intact in this way has the benefit that the results of an analysis, as performed with the preprocessed tokens, can be linked to the full corpus. For example, here we show a quick and dirty example of annotating a tCorpus based on a topic model.
## create tcorpus and preprocess tokens tc = create_tcorpus(sotu_texts, doc_column = 'id', text_columns = 'text') tc$preprocess(use_stemming = T, remove_stopwords=T, min_docfreq = 5) ## fit lda model, using the create_feature argument to store the topic assignments m = tc$lda_fit('feature', create_feature = 'topic', K = 5, alpha = 0.001)
lda_fit R6 method is simply a wrapper for the LDA function in the topicmodels package. However, next to returning the topic model (
m), it also adds a column to
tc$tokens with the topic assignments.
## doc_id token_id token feature topic ## 1: 111541965 1 It <NA> NA ## 2: 111541965 2 is <NA> NA ## 3: 111541965 3 our <NA> NA ## 4: 111541965 4 unfinished unfinish 1 ## 5: 111541965 5 task task 1 ## 6: 111541965 6 to <NA> NA ## 7: 111541965 7 restore restor 1 ## 8: 111541965 8 the <NA> NA ## 9: 111541965 9 basic basic 1 ## 10: 111541965 10 bargain <NA> NA
This is just an example (with a poorly trained topic model) but it demonstrates the purpose. In the tokens we now see that certain words (unfinished, restore) have topic assignments. These topic assignments are based on the topic model trained with the preprocessed versions of the tokens (unfinish, restor). Thus, we can relate the results of a text analysis technique (in this case topic modeling, but the same would work with word scaling, dictionaries, etc.) to the original text. This is important, because in the end we apply these techniques to make inferences about the text, and this approach allows us to validate the results and/or perform additional analyses on the original texts.
We can now also reconstruct the full text with the topic words coloured to visualize the topic assignments. The
browse_text function creates an HTML browser of the full texts, and here we use the
category = 'topic' argument to indicate that the
tokens$topic column should be used to colour the words. If you use the
view = TRUE argument, you’ll also directly see the browser in the viewer panel.
By default, browse_texts only uses the first 500 documents to prevent making huge HTML files. The number of documents can be changed (n = 100), and the selection can be set to random (select = ‘random’), optionally with a seed (seed = 1) to create a reproducible sample for validation purposes.
The browse_texts function can also be used to visualize types of annotations (highlight values between 0 and 1, scale values between -1 and 1, categories such as topics or search results), or to only create a text browser.
One of the nice features of corpustools is the rather extensive querying capabilities. We’ve actually implemented a detailed Lucene-like boolean query language. Not only can you use the common AND, OR and NOT operators (also with parentheses), but you can also look for words within a given word distance. You can also include all features in the tokens data.table in a query, such as part-of-speech tags or lemma.
Furthermore, since we are not concerned with competitive performance on huge databases, we can support some features that are often not supported or accessible in search engines. For instance:
econom*for economy, economist, economic) but not at the start, which can be a problem in languages that like to stick words together, such as Dutch or German. We support wild cards anywhere you like.
A description of the query language can be found in the documentation of the search_features() function.
To demonstrate the search_features() function, we first make a tcorpus of the SOTU speeches. Note that we use split_sentences = T, so that we can also view results at the sentence level.
The search_features() function takes the tcorpus as the first argument. The query should be a character vector with one or multiple queries. Here we look for two queries,
The first time a query search is performed the token column is indexed to enable fast binary search(the tCorpus remembers the index).
The output of
search_features is a
## 284 hits (in 175 documents)
## code hits documents ## 1 query_1 184 126 ## 2 query_2 100 80
The regular output shows the total number of hits, and the summary shows the hits per query. Note, however, that the
code is now query_1 and query_2 because we didn’t label our queries. There are two ways to label queries in corpustools. One is to provide the code labels with the
code argument. The other is to prefix the label in the query string with a hashtag. Here we use the latter.
## code hits documents ## 1 Terror 184 126 ## 2 War 100 80
It is usefull to know that the
featureHits object has a
queries slot. The queries slot contains the query input for provenance, and the hits slot shows the tokens that are matched. Note that queries can match multiple words. The hit_id column indicates the unique matches.
## code query ## 1 Terror terror* ## 2 War war*
## code feature doc_id sentence token_id hit_id ## 1: Terror terrorists 111542025 NA 99 1 ## 2: Terror terrorists 111542114 NA 84 2 ## 3: War war 111542119 NA 5 1 ## 4: Terror terrorism 111542119 NA 34 3 ## 5: War war 111542189 NA 50 2 ## 6: War war 111542284 NA 54 3
Say we have the following search results.
We can use the count_tcorpus method to count hits. Here we use
wide = TRUE to return results in wide format.
## group N V Economy Education Terrorism War ## 1: total 1090 90827 182 126 95 97
## president N V Economy Education Terrorism War ## 1: Barack Obama 554 45956 115 80 14 38 ## 2: George W. Bush 536 44871 67 46 81 59
By setting the wide argument to FALSE, the query results are stacked in a long format, with the columns
count for the labels and scores. Among other things, this makes it easy to prepare data for plotting with ggplot2.
library(ggplot2) date_hits = count_tcorpus(tc, hits, meta_cols='date', wide = F) ggplot(date_hits, aes(x=date, y=count, group=code)) + geom_line(aes(linetype=code)) pres_hits = count_tcorpus(tc, hits, meta_cols='president', wide = F) ggplot(pres_hits, aes(president, count)) + geom_col(aes(fill=code), width=.5, position = "dodge")
We can create semantic networks based on the cooccurence of queries in documents. For more information on using the
semnet_window functions and visualizing semantic networks, see the brief tutorial below.
g is a network in the igraph format. The data can also be extracted as an edgelist (
igraph::get.data.frame(g, 'edges')) or adjacency matrix (
igraph::get.adjacency(g, attr = 'weight')).
## 4 x 4 sparse Matrix of class "dgCMatrix" ## Economy Education Terrorism War ## Economy . 0.1429 0.0385 0.0385 ## Education 0.2063 . 0.0238 0.0635 ## Terrorism 0.0737 0.0316 . 0.2842 ## War 0.0722 0.0825 0.2784 .
We like the idea of using semantic networks for getting a quick indication of the occurence and cooccurence of search results, so we also made it the default plot for a featureHits object (output of search_features) and contextHits object (output of search_contexts). Size of nodes indicates relative frequency, edge width indicates that queries often occur in the same documents, and colors indicate clusters.
We can create HTML browsers that highlight hits in full text. If you run this command in RStudio with
view = TRUE, you’ll directly see the browser in the viewer pane.
The function also returns the path where the html file is saved. A nice thing to know here is that you can use the base function
browseURL(url) to open urls in your default browser.
We can also view hits in keyword-in-context (kwic) listings. This shows the keywords within a window of a given word distance. If keywords represent queries with multiple terms (AND statements, or word proximities), the window is given for each keyword (with a […] separator if the windows do not overlap).
## doc_id code hit_id feature ## 1 111542576 1 America -> freedom ## 2 111542742 2 America -> freedom ## kwic ## 1 ...misguided idealism. In reality, the future security of <America> depends on it. On September the 11th, 2001 [...] and join the fight against terror. Every step toward <freedom> in the world makes our country safer, so we... ## 2 ...In Afghanistan, <America>, our 25 NATO allies, and 15 partner nations are helping the Afghan people defend their <freedom> and rebuild their country. Thanks to the courage of...
Technically, you can do pretty much anything with query hits that you can do with regular tokens. The format of the tcorpus allows any token-level annotations to be added to the tokens data.frame. For adding query hits, we also have a wrapper function that does this, called
## doc_id sentence token_id token code ## 1: 111541965 1 1 It <NA> ## 2: 111541965 1 2 is <NA> ## 3: 111541965 1 3 our <NA> ## 4: 111541965 1 4 unfinished Example ## 5: 111541965 1 5 task <NA> ## 6: 111541965 1 6 to <NA> ## 7: 111541965 1 7 restore Example ## 8: 111541965 1 8 the <NA> ## 9: 111541965 1 9 basic Example ## 10: 111541965 1 10 bargain Example
tc$code_features is an R6 method, and adds the ‘code’ column by reference. If this is new to you, an explanation is provided below in the
Using the tcorpus R6 methods section.
search_contexts function is very similar to
search_features, but it only looks for whether a query occurs in a document or sentence.
## 68 documents
You can also subset a tCorpus with a query
As with regular subset, there is also an R6 method for subset_query that subsets the tCorpus by reference.
This subsets tc without making a copy. Subsetting by reference is explained in more detail in the
Using the tcorpus R6 methods section.
The search_dictionary function, and the related code_dictionary method, are alternative to search_features and search_context for working with larger but less complex dictionaries. For example, dictionaries with names of people, organizations or countries, or scores for certain emotions. Most of these dictionaries do not use Boolean operators such as AND and more complicated patterns such as word proximity.
The following code shows an example (not shown in this vignette, because it depends on quanteda). Here we use one of the dictionaries provided by quanteda. The quanteda dictionary class (dictionary2) can be used as input for the search dictionary functions.
Given a tCorpus, the dictionary can be used (by default on the “token” column) with the
The output of this function is the same as
search_features, so we can do anything with it as discussed above. For example, we can count and plot the data.
code_dictionary R6 method can be used to annotate the tokens data with the results. For the LSD2015 dictionary this would add a column with the labels “positive” and “negative”, and the negated terms “neg_positive” and “neg_negative”.
Instead of using these labels, we can also convert them to scores on a scale from -1 (negative) to 1 (positive). While it is possible to create the numeric scores for sentiment within the tCorpus, we’ll show another option here. Instead of using the quanteda dictionary, we can use a data.frame as a dictionary as long as it has a column named “string” that contains the dictionary pattern. All other columns in the data.frame are then added to $tokens. With the
melt_quanteda_dict function we can also convert a quanteda dictionary to this type of data.frame. The code is then as follows:
browse_texts function can also color words based on a scale value between -1 and 1. This is a nice way to inspect and validate the results of a sentiment dictionary.
We can use
agg_tcorpus to aggregate the results. This is a convenient wrapper for data.table aggregation for tokens data. Especially important is the .id argument, which allows us to specify an id column. In the dictionary results, there are matches that span multiple rows (e.g., “not good”), and we only want to count these values once. By using the code_id column that code_dictionary added to indicate unique matches, we only aggregate unique non-NA rows in tokens.
The by argument also lets us group by columns in the tokens or meta data.
The tCorpus is mainly intended for managing, querying, annotating and browsing tokens in a tokenlist format, but there are some built-in techniques for finding patterns in frequencies and co-occurences of features. Here we use tokens with basic preprocessing as an example, but note that these techniques can be applied for all features and annotations in a tCorpus.
Semantic networks can be created based on the co-occurence of features in contexts (document, sentences) or word windows. For this we use the
It is recommended to first apply preprocessing to filter out terms below a given document frequency, because the number of comparisons and the size of the output matrix/network increases exponentially with vocabulary size.
We can view the network with the
plot_semnet function. However, this network will be a mess if we do not apply some filter on nodes (i.e. features) and edges (i.e. similarity scores). One way is to take only the most frequent words, and use a hard threshold for similarity.
A nice alternative is to use backbone extraction using the
backbone_filter function. We then give an alpha score (similar to a p-value, but see paper reference in function documentation for details) to filter edges. In addition, the max_vertices argument can be used to specify the maximum number of nodes/features. Instead of looking for the most frequent features, this approach keeps lowering the alpha and deleting isolates until the max is reached. As a result, the filter focuses on features with strong co-occurences.
## Used cutoff edge-weight 2.13863525362624e-05 to keep number of vertices under 100
## (For the edges the original weight is still used)
We can compare feature frequencies between two corpora (
compare_corpus), or between subsets of a corpus (
compare_subset). Here we compare the subset of speeches from Obama with other speeches (in this demo data that’s only George Bush)
comp is a data.frame with for each feature how often it occured in the subset (Obama) and the rest of the corpus (Bush), including (smoothed) probability scores, the ratio of probabilities, and a chi2 value.
For a quick peek, we can plot the most strongly overrepresented (ratio > 1) and underrepresented (ratio < 1). By default, the plot method for the output of a vocabulary comparison function(such as compare_subset) plots the log ratio on the x axis, and uses the chi2 score to determine the size of words (chi2 can be larger for more frequent terms, but also relies on how strongly a term is over/under-represented).
For alternative settings, see
Similar to a corpus comparison, we can compare how frequently words occur close to the results of a query to the overal frequency of these words. This gives a nice indication of the context in which words queries tend to occur.
The feature_associations function either requires a query or
hits (results of search_features). With the feature argument we indiate for what feature column we want to compare the values. Important to consider is that this is not necessarily the column on which we want to perform the query. For example, we might want to compare frequencies of stemmed, lowercase words, or topics from a topic model, but we want to perform the query on the raw tokens.
By default, the query is always performed on the ‘token’ column. In the following example we thus query on the tokens, but look at the frequencies of the preprocessed features. We query for "terror*".
fa is almost identical to the output of a corpus comparison, but by default feature_associtaions only returns the top features with the strongest overrepresentation (underrepresentation is not really interesting here).
## feature freq freq_NOT ratio chi2 ## 837 war 25 61 7.35 103.1 ## 74 attack 16 30 9.57 84.4 ## 540 offens 8 5 28.41 74.8 ## 843 weapon 20 54 6.64 74.2 ## 115 camp 7 4 30.97 66.7 ## 308 fight 19 57 5.98 62.8
We can also plot the output as a wordcloud.
For more details see
The downside of preserving all text information of the corpus is that all this information has to be kept in memory. To work with this type of data efficiently, we need to prevent making unneccary copies of the data. The tCorpus therefore uses the data.table package to manage the tokens and meta data, and uses R6 class methods to modify data by reference (i.e. without copying).
R6 methods are accessed with the dollar symbol, similar to how you access columns in a data.frame. For example, this is how you access the tcorpus subset method.
R6 methods of the tCorpus modify the data by reference. That is, instead of creating a new tCorpus that can be assigned to a new name, the tCorpus from which the method is used is modified directly. In the following example we use the R6 subset method to subset tc by reference.
## doc_id token_id token ## 1: 1 1 this ## 2: 1 2 is
Note that the subset has correctly been performed. I did not need to assign the output of
tc$subset() to overwrite tc. That is, we didn’t need to do
tc = tc$subset(token_id < 3). The advantage of this approach is that tc does not have to be copied, which can be slow or problematic memory wise with a large corpus.
However, the catch is that you need to be aware of the distinction between deep and shallow copies (one of the nice things about R is that you normally do not have to worry about this). If this is new to you, please read the following section. Otherwise, skip to
Copying a tCorpus to see how you can make deep copies if you want to.
Here we show two examples of possible mistakes in using corpustools that you might make if you are not familiar with modifying by reference. The following section shows how to do this properly.
Firstly, you can assign the output of a tCorpus R6 method, but this will only be a shallow copy of the tCorpus. The original corpus will still be modified. In the following example, we create a subset of tc and assign it to tc2, but we see that tc itself is still modified.
##  TRUE
The second thing to be aware of is that you do not copy a tCorpus by assigning it to a different name. In the following example, tc2 is a shallow copy of tc, and by modifying tc2 we also modify tc.
##  TRUE
The tCorpus R6 methods modify the tCorpus by reference. If you do not want this, you have several options. Firstly, for methods where it makes sense, we provide identical regular functions that do not modify the tCorpus by reference. For example, instead of the subset R6 method you could use the subset function is classic R style.
Alternatively, these R6 methods also have a copy argument, which can be set to TRUE to force a deep copy.
Finally, you could always make a deep copy of the entire tCorpus. This is not the same as simply assigning to the tCorpus to a new name.
The difference is that in the shallow copy, if you change tc2 by reference (e.g., by using the subset R6 method), it wil also change tc!! With a deep copy, this does not happen because the entire tCorpus is copied and stored seperately in memory.
Of course, making deep copies is ‘expensive’ in speed and memory. Our recommended approach is therefore to not overwrite or remove data from the corpus, but rather add columns or make values NA. The tCorpus was designed to keep all information in one big corpus, and manages to do this fairly efficiently by properly using the data.table package to prevent copying where possible.