Introduction to wikisourcer

Félix Luginbül

The digital library Wikisource, a sister projet of Wikipedia, hosts books in the public domain in almost all languages. More than 100’000 books are accessible in English, Spanish, French, German, Russian or Chinese.

The wikisourcer R package helps you download any book or page from Wikisource. The text is downloaded in a tidy data frame, so it can be analyzed within the tidyverse ecosystem as explained for example in the book Text mining with R.

Download books

To download Voltaire’s philosophical novel Candide, simply paste the url of the table of content into the wikisource_book function. Note that the book is already classified by chapter with the page variable.

library(wikisourcer)

wikisource_book("https://en.wikisource.org/wiki/Candide")
## # A tibble: 894 x 5
##    text                           page language url                  title
##    <chr>                         <int> <chr>    <chr>                <chr>
##  1 ""                                1 en       https://en.wikisour… Cand…
##  2 In the country of Westphalia…     1 en       https://en.wikisour… Cand…
##  3 "The Baron was one of the mo…     1 en       https://en.wikisour… Cand…
##  4 My Lady Baroness, who weighe…     1 en       https://en.wikisour… Cand…
##  5 Master Pangloss taught the m…     1 en       https://en.wikisour… Cand…
##  6 "\"It is demonstrable,\" sai…     1 en       https://en.wikisour… Cand…
##  7 Candide listened attentively…     1 en       https://en.wikisour… Cand…
##  8 One day when Miss Cunegund w…     1 en       https://en.wikisour… Cand…
##  9 On her way back she happened…     1 en       https://en.wikisour… Cand…
## 10 ""                                1 en       https://en.wikisour… Cand…
## # ... with 884 more rows

Multiple books can easily be downoaded using the purrr package. For example, we can download Candide in French, English, Spanish and Italian.

library(purrr)

fr <- "https://fr.wikisource.org/wiki/Candide,_ou_l%E2%80%99Optimisme/Garnier_1877"
en <- "https://en.wikisource.org/wiki/Candide"
es <- "https://es.wikisource.org/wiki/C%C3%A1ndido,_o_el_optimismo"
it <- "https://it.wikisource.org/wiki/Candido"
urls <- c(fr, en, es, it)

candide <- purrr::map_df(urls, wikisource_book)

Before making a text analysis, the text should be cleaned from remaining Wikisource metadata.

library(stringr)
library(dplyr)

candide_cleaned <- candide %>%
  filter(!str_detect(text, "CHAPITRE|↑")) %>% #clean French
  filter(!str_detect(text, "CAPITULO")) %>% #clean Spanish
  filter(!str_detect(text, "../|IncludiIntestazione|Romanzi|^\\d+")) #clean Italian

We can now compare the number of words in each chapter by language.

library(tidytext)
library(ggplot2)

candide_cleaned %>%
  tidytext::unnest_tokens(word, text) %>%
  count(page, language, sort = TRUE) %>%
  ggplot(aes(x = as.factor(page), y = n, fill = language)) +
    geom_col(position = "dodge") +
    theme_minimal() +
    labs(x = "chapter", y = "number of words",
         title = "Multilingual Text analysis of Voltaire's Candide")

Download pages

The wikisource_book function sometimes doesn’t work. It happens when the main url path differs from the ones of the linked urls. This issue can easily be fixed using the wikisource_page function.

The wikisource_page function has two arguments, i.e. the Wikisource url and an optional title for the page. For example, we can download Sonnet 18 of William Shakespeare.

library(wikisourcer)

wikisource_page("https://en.wikisource.org/wiki/Sonnet_18_(Shakespeare)", "Sonnet 18")
## # A tibble: 26 x 4
##    text                       page    language url                        
##    <chr>                      <chr>   <chr>    <chr>                      
##  1 ""                         Sonnet… en       https://en.wikisource.org/…
##  2 ""                         Sonnet… en       https://en.wikisource.org/…
##  3 Shall I compare thee to a… Sonnet… en       https://en.wikisource.org/…
##  4 Thou art more lovely and … Sonnet… en       https://en.wikisource.org/…
##  5 Rough winds do shake the … Sonnet… en       https://en.wikisource.org/…
##  6 And summer's lease hath a… Sonnet… en       https://en.wikisource.org/…
##  7 Sometime too hot the eye … Sonnet… en       https://en.wikisource.org/…
##  8 And often is his gold com… Sonnet… en       https://en.wikisource.org/…
##  9 And every fair from fair … Sonnet… en       https://en.wikisource.org/…
## 10 By chance, or nature's ch… Sonnet… en       https://en.wikisource.org/…
## # ... with 16 more rows

Let’s try to download the 154 Sonnets of William Shakespeare using wikisource_book.

wikisource_book("https://en.wikisource.org/wiki/The_Sonnets")
## Warning in wikisource_book("https://en.wikisource.org/wiki/The_Sonnets"):
## Could not download a book at https://en.wikisource.org/wiki/The_Sonnets
## # A tibble: 0 x 1
## # ... with 1 variable: title <chr>

The download failed because the main wiki url wiki/The_Sonnets differs from the wiki path of the pages, i.e. wiki/Sonnet_.

We have to use the wikisource_page function to download the 154 Sonnets.

Note that the base R function paste0 is very useful to create a list of urls. We will also use paste0 to name the pages for the second argument of the wikisource_page function.

urls <- paste0("https://en.wikisource.org/wiki/Sonnet_", 1:154, "_(Shakespeare)") #154 urls

sonnets <- purrr::map2_df(urls, paste0("Sonnet ", 1:154), wikisource_page)
sonnets
## # A tibble: 3,275 x 4
##    text                        page    language url                       
##    <chr>                       <chr>   <chr>    <chr>                     
##  1 ""                          Sonnet… en       https://en.wikisource.org…
##  2 ""                          Sonnet… en       https://en.wikisource.org…
##  3 From fairest creatures we … Sonnet… en       https://en.wikisource.org…
##  4 That thereby beauty's rose… Sonnet… en       https://en.wikisource.org…
##  5 But as the riper should by… Sonnet… en       https://en.wikisource.org…
##  6 His tender heir might bear… Sonnet… en       https://en.wikisource.org…
##  7 But thou, contracted to th… Sonnet… en       https://en.wikisource.org…
##  8 Feed'st thy light's flame … Sonnet… en       https://en.wikisource.org…
##  9 Making a famine where abun… Sonnet… en       https://en.wikisource.org…
## 10 Thyself thy foe, to thy sw… Sonnet… en       https://en.wikisource.org…
## # ... with 3,265 more rows

We can make a text similarity analysis. Which sonnets are the closest to each others in terms of words used?

library(widyr)
library(SnowballC)
library(igraph)
library(ggraph)

sonnets_similarity <- sonnets %>%
  filter(!str_detect(text, "public domain|Public domain")) %>% #clean text
  tidytext::unnest_tokens(word, text) %>%
  anti_join(tidytext::get_stopwords("en")) %>%
  anti_join(data_frame(word = c("thy", "thou", "thee"))) %>% #old English stopwords
  mutate(wordStem = SnowballC::wordStem(word)) %>% #Stemming
  count(page, wordStem) %>%
  widyr::pairwise_similarity(page, wordStem, n) %>%
  filter(similarity > 0.3)

# themes by sonnet 
theme <- data_frame(page = unique(sonnets$page),
                    theme = c(rep("Procreation", times = 17), rep("Fair Youth", times = 60),
                              rep("Rival Poet", times = 9), rep("Fair Youth", times = 12),
                              rep("Irregular", times = 1), rep("Fair Youth", times = 26),
                              rep("Irregular", times = 1), rep("Dark Lady", times = 28))) %>%
  filter(page %in% sonnets_similarity$item1 |
         page %in% sonnets_similarity$item2)

set.seed(1234)

sonnets_similarity %>%
  graph_from_data_frame(vertices = theme) %>%
  ggraph() +
  geom_edge_link(aes(edge_alpha = similarity)) +
  geom_node_point(aes(color = theme), size = 3) +
  geom_node_text(aes(label = name), size = 3.5, check_overlap = TRUE, vjust = 1) +
  theme_void() +
  labs(title = "Closest Shakespeare's Sonnets to each others in terms of words used")

Sentiment analysis

A tidy sentiment analysis of a book by chapter can easily be made using the wikisource_book function, as the chapters are automatically created in the page variable.

library(tidyr)

jane <- wikisource_book("https://en.wikisource.org/wiki/Pride_and_Prejudice")

jane_sent <- jane %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("bing")) %>%
  anti_join(get_stopwords("en")) %>%
  count(page, sentiment) %>%
  spread(key = sentiment, value = n) %>%
  mutate(sentiment = positive - negative)

ggplot(jane_sent, aes(page, sentiment)) +
  geom_col() +
  geom_smooth(method = "loess", se = FALSE) +
  theme_minimal() +
  labs(title = "Sentiment analysis of “Pride and Prejudice”",
       subtitle = "Positive-negative words difference, by chapter",
       x = "chapter", y = "sentiment score")