Read text files with readtext()

# Load readtext package
library(readtext)

1. Introduction

The vignette walks you through importing a variety of different text files into R using the readtext package. Currently, readtext supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF and Microsoft Word formatted files (.pdf, .doc, .docx).

readtext also handles multiple files and file types using for instance a “glob” expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz). Usually, you do not have to determine the format of the files explicitly - readtext takes this information from the file ending.

The readtext package comes with a data directory called extdata that contains examples of all files listed above. In the vignette, we use this data directory.

# Get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")

The extdata directory contains several subfolders that include different text files. In the following examples, we load one or more files stored in each of these folders. The paste0 command is used to concatenate the extdata folder from the readtext package with the subfolders. When reading in custom text files, you will need to determine your own data directory (see ?setwd()).

2. Reading one or more text files

2.1 Plain text files (.txt)

The folder “txt” contains a subfolder named UDHR with .txt files of the Universal Declaration of Human Rights in 13 languages.

# Read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
## readtext object consisting of 13 documents and 0 docvars.
## # data.frame [13 × 2]
##              doc_id                      text
##               <chr>                     <chr>
## 1  UDHR_chinese.txt "世界人权宣言\n联合国"...
## 2    UDHR_czech.txt           "VŠEOBECNÁ "...
## 3   UDHR_danish.txt           "Den 10. de"...
## 4  UDHR_english.txt           "Universal "...
## 5   UDHR_french.txt           "Déclaratio"...
## 6 UDHR_georgian.txt           "FLFVBFYBC "...
## # ... with 7 more rows

We can specify document-level metadata (docvars) based on the file names or on a separate data.frame. Below we take the docvars from the filenames (docvarsfrom = "filenames") and set the names for each variable (docvarnames = c("unit", "context", "year", "language", "party")). The command dvsep = "_" determines the separator (a regular expression character string) included in the filenames to delimit the docvar elements.

# Manifestos with docvars from filenames
readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
         docvarsfrom = "filenames", 
         docvarnames = c("unit", "context", "year", "language", "party"),
         dvsep = "_", 
         encoding = "ISO-8859-1")
## readtext object consisting of 17 documents and 5 docvars.
## # data.frame [17 × 7]
##                    doc_id             text  unit context  year language
##                     <chr>            <chr> <chr>   <chr> <int>    <chr>
## 1 EU_euro_2004_de_PSE.txt  "PES · PSE "...    EU    euro  2004       de
## 2   EU_euro_2004_de_V.txt  "Gemeinsame"...    EU    euro  2004       de
## 3 EU_euro_2004_en_PSE.txt  "PES · PSE "...    EU    euro  2004       en
## 4   EU_euro_2004_en_V.txt "Manifesto\n"...    EU    euro  2004       en
## 5 EU_euro_2004_es_PSE.txt  "PES · PSE "...    EU    euro  2004       es
## 6   EU_euro_2004_es_V.txt "Manifesto\n"...    EU    euro  2004       es
## # ... with 11 more rows, and 1 more variables: party <chr>

readtext can also curse through subdirectories. In our example, the folder txt/movie_reviews contains two subfolders (called neg and pos). We can load all texts included in both folders.

# Recurse through subdirectories
readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"))
## readtext object consisting of 10 documents and 0 docvars.
## # data.frame [10 × 2]
##                    doc_id            text
##                     <chr>           <chr>
## 1 neg/neg_cv000_29416.txt "plot : two"...
## 2 neg/neg_cv001_19502.txt "the happy "...
## 3 neg/neg_cv002_17424.txt "it is movi"...
## 4 neg/neg_cv003_12683.txt " " quest f"...
## 5 neg/neg_cv004_12641.txt "synopsis :"...
## 6 pos/pos_cv000_29590.txt "films adap"...
## # ... with 4 more rows

2.2 Comma- or tab-separated values (.csv, .tab, .tsv)

Read in comma separted values (.csv files) that contain textual data. We determine the texts variable in our .csv file as the text_field. This is the column that contains the actual text. The other columns of the original csv file (Year, President, FirstName) are by default treated as document-level variables.

# Read in comma-separated values
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
## readtext object consisting of 5 documents and 3 docvars.
## # data.frame [5 × 5]
##              doc_id            text  Year  President FirstName
##               <chr>           <chr> <int>      <chr>     <chr>
## 1 inaugCorpus.csv.1 "Fellow-Cit"...  1789 Washington    George
## 2 inaugCorpus.csv.2 "Fellow cit"...  1793 Washington    George
## 3 inaugCorpus.csv.3 "When it wa"...  1797      Adams      John
## 4 inaugCorpus.csv.4 "Friends an"...  1801  Jefferson    Thomas
## 5 inaugCorpus.csv.5 "Proceeding"...  1805  Jefferson    Thomas

The same procedure applies to tab-separated values.

# Read in tab-separated values
readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech")
## readtext object consisting of 33 documents and 9 docvars.
## # data.frame [33 × 11]
##             doc_id            text speechID memberID partyID constID
##              <chr>           <chr>    <int>    <int>   <int>   <int>
## 1 dailsample.tsv.1 "Molaimse d"...        1      977      22     158
## 2 dailsample.tsv.2 "Is bród mó"...        2     1603      22     103
## 3 dailsample.tsv.3 "' A cháird"...        3      116      22     178
## 4 dailsample.tsv.4 "Tá ceathra"...        4      116      22     178
## 5 dailsample.tsv.5 "Léighfead "...        5      116      22     178
## 6 dailsample.tsv.6 "-Braithean"...        6      116      22     178
## # ... with 27 more rows, and 5 more variables: title <chr>, date <chr>,
## #   member_name <chr>, party_name <chr>, const_name <chr>

2.3 JSON data (.json)

You can also read .json data. Again you need to specify the text_field.

## Read in JSON data
readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts")
## Warning in doTryCatch(return(expr), name, parentenv, handler): Doesn't look
## like Tweets json file, trying general JSON
## readtext object consisting of 3 documents and 3 docvars.
## # data.frame [3 × 5]
##                    doc_id            text  Year  President FirstName
##                     <chr>           <chr> <int>      <chr>     <chr>
## 1 inaugural_sample.json.1 "Fellow-Cit"...  1789 Washington    George
## 2 inaugural_sample.json.2 "Fellow cit"...  1793 Washington    George
## 3 inaugural_sample.json.3 "When it wa"...  1797      Adams      John

2.4 PDF files

readtext can also read in and convert .pdf files.

In the example below we load all .pdf files stored in the UDHR folder, and determine that the docvars shall be taken from the filenames. We call the document-level variables document and language, and specify the delimiter (dvsep).

## Read in Universal Declaration of Human Rights pdf files
(rt_pdf <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), 
                    docvarsfrom = "filenames", 
                    docvarnames = c("document", "language"),
                    sep = "_"))
## readtext object consisting of 11 documents and 2 docvars.
## # data.frame [11 × 4]
##             doc_id                      text document language
##              <chr>                     <chr>    <chr>    <chr>
## 1 UDHR_chinese.pdf "世界人权宣言\n联合国"...     UDHR  chinese
## 2   UDHR_czech.pdf           "VŠEOBECNÁ "...     UDHR    czech
## 3  UDHR_danish.pdf           "Den 10. de"...     UDHR   danish
## 4 UDHR_english.pdf           "Universal "...     UDHR  english
## 5  UDHR_french.pdf           "Déclaratio"...     UDHR   french
## 6   UDHR_greek.pdf           "ΟΙΚΟΥΜΕΝΙΚ"...     UDHR    greek
## # ... with 5 more rows

2.5 Microsoft Word files (.doc, .docx)

Microsoft Word formatted files are converted through the package antiword for older .doc files, and using XML for newer .docx files.

## Read in Word data (.docx)
readtext(paste0(DATA_DIR, "/word/*.docx"))
## readtext object consisting of 2 documents and 0 docvars.
## # data.frame [2 × 2]
##                        doc_id            text
##                         <chr>           <chr>
## 1 UK_2015_EccentricParty.docx "The Eccent"...
## 2     UK_2015_LoonyParty.docx "The Offici"...

2.6 Text from URLs

You can also read in text directly from a URL.

# Note: Example required: which URL should we use?

2.7 Text from archive files (.zip, .tar, .tar.gz, .tar.bz)

Finally, it is possible to inclue text from archives.

# Note: Archive file required. The only zip archive included in readtext has 
# different encodings and is difficult to import (see section 4.2).

3. Inter-operability with quanteda

readtext was originally developed in early versions of the quanteda package for the quantitative analysis of textual data. It was spawned from the textfile() function from that package, and now lives exclusively in readtext. Because quanteda’s corpus constructor recognizes the data.frame format returned by readtext(), it can construct a corpus directly from a readtext object, preserving all docvars and other meta-data.

require(quanteda)

You can easily contruct a corpus from a readtext object.

# read in comma-separated values with readtext
rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")

# create quanteda corpus
corpus_csv <- corpus(rt_csv)
summary(corpus_csv, 5)
## Corpus consisting of 5 documents.
## 
##               Text Types Tokens Sentences Year  President FirstName
##  inaugCorpus.csv.1   626   1542        23 1789 Washington    George
##  inaugCorpus.csv.2    96    147         4 1793 Washington    George
##  inaugCorpus.csv.3   826   2584        37 1797      Adams      John
##  inaugCorpus.csv.4   716   1935        41 1801  Jefferson    Thomas
##  inaugCorpus.csv.5   804   2381        45 1805  Jefferson    Thomas
## 
## Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/readtext/vignettes/* on x86_64 by kbenoit
## Created: Sun May 21 20:24:55 2017
## Notes:

4. Solving common problems

4.1 Remove page numbers using regular expressions

When a document contains page numbers, they are imported as well. If you want to remove them, you can use a regular expression. We strongly recommend using the stringi package. For the most common regular expressions you can look at this cheatsheet.

You first need to check in the original file in which format the page numbers occur (e.g., “1”, “-1-”, “page 1” etc.). We can make use of the fact that page numbers are almost always preceded and followed by a linebreak (\n). After loading the text with readtext, you can replace the page numbers.

# Load stringi package
require(stringi)

In the first example, the page numbers have the format “page X”.

# Make some text with page numbers
sample_text_a <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, 
page 1 
with the newspaper from a boy named quick Seamus, in his mouth.
page 2
The quicker brown fox jumped over 2 lazy dogs."

sample_text_a
## [1] "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, \npage 1 \nwith the newspaper from a boy named quick Seamus, in his mouth.\npage 2\nThe quicker brown fox jumped over 2 lazy dogs."

# Remove "page" and respective digit
sample_text_a2 <- unlist(stri_split_fixed(sample_text_a, '\n'), use.names = FALSE)
sample_text_a2 <- stri_replace_all_regex(sample_text_a2, "page \\d*", "")
sample_text_a2 <- stri_trim_both(sample_text_a2)
sample_text_a2 <- sample_text_a2[sample_text_a2 != '']
stri_paste(sample_text_a2, collapse = '\n')
## [1] "The quick brown fox named Seamus jumps over the lazy dog also named Seamus,\nwith the newspaper from a boy named quick Seamus, in his mouth.\nThe quicker brown fox jumped over 2 lazy dogs."

In the second example we remove page numbers which have the format “- X -”.

sample_text_b <- "The quick brown fox named Seamus 
- 1 - 
jumps over the lazy dog also named Seamus, with 
- 2 - 
the newspaper from a boy named quick Seamus, in his mouth. 
- 33 - 
The quicker brown fox jumped over 2 lazy dogs."

sample_text_b
## [1] "The quick brown fox named Seamus \n- 1 - \njumps over the lazy dog also named Seamus, with \n- 2 - \nthe newspaper from a boy named quick Seamus, in his mouth. \n- 33 - \nThe quicker brown fox jumped over 2 lazy dogs."

sample_text_b2 <- unlist(stri_split_fixed(sample_text_b, '\n'), use.names = FALSE)
sample_text_b2 <- stri_replace_all_regex(sample_text_b2, "[-] \\d* [-]", "")
sample_text_b2 <- stri_trim_both(sample_text_b2)
sample_text_b2 <- sample_text_b2[sample_text_b2 != '']
stri_paste(sample_text_b2, collapse = '\n')
## [1] "The quick brown fox named Seamus\njumps over the lazy dog also named Seamus, with\nthe newspaper from a boy named quick Seamus, in his mouth.\nThe quicker brown fox jumped over 2 lazy dogs."

Such stringi functions can also be applied to readtext objects.

4.2 Read files with different encodings

Sometimes files of the same type have different encodings. If the encoding of a file is included in the file name, we can extract this information and import the texts correctly.

# create a temporary directory to extract the .zip file
FILEDIR <- tempdir()
# unzip file
unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"), exdir = FILEDIR)

Here, we will get the encoding from the filenames themselves.

# get encoding from filename
filenames <- list.files(FILEDIR, "^(Indian|UDHR_).*\\.txt$")

head(filenames)
## [1] "IndianTreaty_English_UTF-16LE.txt" 
## [2] "IndianTreaty_English_UTF-8-BOM.txt"
## [3] "UDHR_Arabic_ISO-8859-6.txt"        
## [4] "UDHR_Arabic_UTF-8.txt"             
## [5] "UDHR_Arabic_WINDOWS-1256.txt"      
## [6] "UDHR_Chinese_GB2312.txt"

# Strip the extension
filenames <- gsub(".txt$", "", filenames)
parts <- strsplit(filenames, "_")
fileencodings <- sapply(parts, "[", 3)

head(fileencodings)
## [1] "UTF-16LE"     "UTF-8-BOM"    "ISO-8859-6"   "UTF-8"       
## [5] "WINDOWS-1256" "GB2312"

# Check whether certain file encodings are not supported
notAvailableIndex <- which(!(fileencodings %in% iconvlist()))
fileencodings[notAvailableIndex]
## [1] "UTF-8-BOM"

If we read the text files without specifying the encoding, we get erroneously formatted text. To avoid this, we determine the encoding using the character object fileencoding created above.

We can also add docvars based on the filenames.

txts <- readtext(paste0(DATA_DIR, "/data_files_encodedtexts.zip"), 
                 encoding = fileencodings,
                 docvarsfrom = "filenames", 
                 docvarnames = c("document", "language", "input_encoding"))
print(txts, n = 50)
## readtext object consisting of 36 documents and 3 docvars.
## # data.frame [36 × 5]
##                                doc_id                      text
##                                 <chr>                     <chr>
## 1   IndianTreaty_English_UTF-16LE.txt           "WHEREAS, t"...
## 2  IndianTreaty_English_UTF-8-BOM.txt           "ARTICLE 1."...
## 3          UDHR_Arabic_ISO-8859-6.txt          "الديباجة\nل"...
## 4               UDHR_Arabic_UTF-8.txt          "الديباجة\nل"...
## 5        UDHR_Arabic_WINDOWS-1256.txt          "الديباجة\nل"...
## 6             UDHR_Chinese_GB2312.txt "世界人权宣言\n联合国"...
## 7                UDHR_Chinese_GBK.txt "世界人权宣言\n联合国"...
## 8              UDHR_Chinese_UTF-8.txt "世界人权宣言\n联合国"...
## 9           UDHR_English_UTF-16BE.txt           "Universal "...
## 10          UDHR_English_UTF-16LE.txt           "Universal "...
## 11             UDHR_English_UTF-8.txt           "Universal "...
## 12      UDHR_English_WINDOWS-1252.txt           "Universal "...
## 13         UDHR_French_ISO-8859-1.txt           "Déclaratio"...
## 14              UDHR_French_UTF-8.txt           "Déclaratio"...
## 15       UDHR_French_WINDOWS-1252.txt           "Déclaratio"...
## 16         UDHR_German_ISO-8859-1.txt           "Die Allgem"...
## 17              UDHR_German_UTF-8.txt           "Die Allgem"...
## 18       UDHR_German_WINDOWS-1252.txt           "Die Allgem"...
## 19              UDHR_Greek_CP1253.txt           "ΟΙΚΟΥΜΕΝΙΚ"...
## 20          UDHR_Greek_ISO-8859-7.txt           "ΟΙΚΟΥΜΕΝΙΚ"...
## 21               UDHR_Greek_UTF-8.txt           "ΟΙΚΟΥΜΕΝΙΚ"...
## 22               UDHR_Hindi_UTF-8.txt           "मानव अधिका"...
## 23      UDHR_Icelandic_ISO-8859-1.txt           "Mannréttin"...
## 24           UDHR_Icelandic_UTF-8.txt           "Mannréttin"...
## 25    UDHR_Icelandic_WINDOWS-1252.txt           "Mannréttin"...
## 26            UDHR_Japanese_CP932.txt  "『世界人権宣言』\n "...
## 27      UDHR_Japanese_ISO-2022-JP.txt  "『世界人権宣言』\n "...
## 28            UDHR_Japanese_UTF-8.txt  "『世界人権宣言』\n "...
## 29      UDHR_Japanese_WINDOWS-936.txt  "『世界人権宣言』\n "...
## 30        UDHR_Korean_ISO-2022-KR.txt      "세 계 인 권 선 "...
## 31              UDHR_Korean_UTF-8.txt      "세 계 인 권 선 "...
## 32        UDHR_Russian_ISO-8859-5.txt           "Всеобщая д"...
## 33            UDHR_Russian_KOI8-R.txt           "Всеобщая д"...
## 34             UDHR_Russian_UTF-8.txt           "Всеобщая д"...
## 35      UDHR_Russian_WINDOWS-1251.txt           "Всеобщая д"...
## 36                UDHR_Thai_UTF-8.txt            "ปฏิญญาสากล"...
## # ... with 3 more variables: document <chr>, language <chr>,
## #   input_encoding <chr>

From this file we can easily create a quanteda corpus object.

corpus_txts <- corpus(txts)
summary(corpus_txts, 5)
## Corpus consisting of 36 documents, showing 5 documents.
## 
##                                Text Types Tokens Sentences     document
##   IndianTreaty_English_UTF-16LE.txt   690   2938       155 IndianTreaty
##  IndianTreaty_English_UTF-8-BOM.txt   646   3104       154 IndianTreaty
##          UDHR_Arabic_ISO-8859-6.txt   753   1555        86         UDHR
##               UDHR_Arabic_UTF-8.txt   753   1555        86         UDHR
##        UDHR_Arabic_WINDOWS-1256.txt   753   1555        86         UDHR
##  language input_encoding
##   English       UTF-16LE
##   English      UTF-8-BOM
##    Arabic     ISO-8859-6
##    Arabic          UTF-8
##    Arabic   WINDOWS-1256
## 
## Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/readtext/vignettes/* on x86_64 by kbenoit
## Created: Sun May 21 20:24:55 2017
## Notes: