What do Wikipedia’s readers care about? Is Britney Spears more popular than Brittany? Is Asia Carrera more popular than Asia? How many people looked at the article on Santa Claus in December? How many looked at the article on Ron Paul?
What can you find?
Source: http://stats.grok.se/

The wikipediatrend package provides convenience access to daily page view counts (Wikipedia article traffic statistics) stored at http://stats.grok.se/ .

If you want to know how often an article has been viewed over time and work with the data from within R, this package is for you. Maybe you want to compare how much attention articles from different languages got and when, this package is for you. Are you up to policy studies or epidemiology? Have a look at page counts for Flue, Ebola, Climate Change or Millennium Development Goals and maybe build a model or two. Again, this package is for you.

If you simply want to browse Wikipedia page view statistics without all that coding, visit http://stats.grok.se/ and have a look around.

If non-big data is not an option, get the raw data in their entity at https://dumps.wikimedia.org/other/pagecounts-raw/ .

If you think days are crude measures of time but seconds might do if need be and info about which article views led to the numbers is useless anyways - go to https://datahub.io/dataset/english-wikipedia-pageviews-by-second.

To get further information on the data source (Who? When? How? How good?) there is a Wikipedia article for that: https://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics and another one: https://en.wikipedia.org/wiki/Wikipedia:About_page_view_statistics .

1 Installation

stable CRAN version (https://cran.r-project.org/package=wikipediatrend)

install.packages("wikipediatrend")

developemnt version (https://github.com/petermeissner/wikipediatrend)

devtools::install_github("petermeissner/wikipediatrend")

… and load it via:

library(wikipediatrend)

2 A first try

The workhorse of the package is the wp_trend() function that allows you to get page view counts as neat data frames like this:

page_views <- wp_trend("main_page", from = "2015-10-01", to = "2015-11-30")
    
page_views
##    date       count    lang page      rank month  title    
## 1  2015-10-01 11463737 en   Main_page 2    201510 Main_page
## 2  2015-10-02 10756303 en   Main_page 2    201510 Main_page
## 3  2015-10-03 10220428 en   Main_page 2    201510 Main_page
## 6  2015-10-06 11048888 en   Main_page 2    201510 Main_page
## 7  2015-10-07 10888350 en   Main_page 2    201510 Main_page
## 8  2015-10-08 11202379 en   Main_page 2    201510 Main_page
## 9  2015-10-09 10584602 en   Main_page 2    201510 Main_page
## 10 2015-10-10 10791344 en   Main_page 2    201510 Main_page
## 11 2015-10-11 10650765 en   Main_page 2    201510 Main_page
## 16 2015-10-17 10418354 en   Main_page 2    201510 Main_page
## 20 2015-10-21 13761105 en   Main_page 2    201510 Main_page
## 23 2015-10-24  9759098 en   Main_page 2    201510 Main_page
## 24 2015-10-25 12554813 en   Main_page 2    201510 Main_page
## 25 2015-10-26 13618903 en   Main_page 2    201510 Main_page
## 28 2015-10-29 11696058 en   Main_page 2    201510 Main_page
## 29 2015-10-30 11124427 en   Main_page 2    201510 Main_page
## 33 2015-11-03 13554961 en   Main_page 2    201511 Main_page
## 34 2015-11-04 12793061 en   Main_page 2    201511 Main_page
## 39 2015-11-09 12873821 en   Main_page 2    201511 Main_page
## 42 2015-11-12 11199735 en   Main_page 2    201511 Main_page
## 45 2015-11-15 11473133 en   Main_page 2    201511 Main_page
## 48 2015-11-18 12237389 en   Main_page 2    201511 Main_page
## 50 2015-11-20 10119118 en   Main_page 2    201511 Main_page
## 51 2015-11-21 11203957 en   Main_page 2    201511 Main_page
## 52 2015-11-22 11606721 en   Main_page 2    201511 Main_page
## 53 2015-11-23 12908870 en   Main_page 2    201511 Main_page
## 55 2015-11-25 11250691 en   Main_page 2    201511 Main_page
## 57 2015-11-27 10305111 en   Main_page 2    201511 Main_page
## 60 2015-11-30 13704672 en   Main_page 2    201511 Main_page
## 
## ... 31 rows of data not shown

… that can easily be turned into a plot …

plot(
  page_views[, c("date","count")], 
  type="b"
)

… or like that …

library(ggplot2)

ggplot(page_views, aes(x=date, y=count)) + 
  geom_line(size=1.5, colour="steelblue") + 
  geom_smooth(method="loess", colour="#00000000", fill="#001090", alpha=0.1) +
  scale_y_continuous( breaks=seq(5e6, 50e6, 5e6) , 
  label= paste(seq(5,50,5),"M") ) +
  theme_bw()

3 wp_trend() options

wp_trend() has several options and most of them are set to defaults:

3.1 page

The page option allows to specify one or more article titles for which data should be retrieved.

These titles should be in the same format as shown in the address bar of your browser to ensure that the pages are found. If we want to get page views for the United Nations Millennium Development Goals and the article is found here: https://en.wikipedia.org/wiki/Millennium_Development_Goals the page title to pass to wp_trend() should be Millennium_Development_Goals not Millennium Development Goals or Millennium_development_goals or any other ‘mostly-like-the-original’ variation.

To ease data gathering wp_trend() page accepts whole vectors of page titles and will retrieve date for each one after another.

page_views <- 
  wp_trend( 
    page = c( "Millennium_Development_Goals", "Climate_Change"), 
    from = "2015-01-01",
    to   = "2015-01-30"
  )
library(ggplot2)

ggplot(page_views, aes(x=date, y=count, group=page, color=page)) + 
  geom_line(size=1.5) + theme_bw()

3.2 from and to

These two options determine the time frame for which data shall be retrieved. The defaults are set to gather the last 30 days but might be set to cover larger time frames as well. Note that there is no data prior to December 2007 so that any date prior will be set to this minimum.

page_views <- 
  wp_trend( 
    page = "Millennium_Development_Goals" ,
    from = "2000-01-01",
    to   = "2009-01-30"
  )
library(ggplot2)

ggplot(page_views, aes(x=date, y=count, color=wp_year(date))) + 
  geom_line() + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 22), color="#CD0000a0", size=1.2) +
  theme_bw() 

3.3 lang

This option determines for which Wikipedia the page views shall be retrieved, English, German, Chinese, Spanish, … . The default is set to "en" for the English Wikipedia. This option should get one language shorthand that then is used for all pages or for each page a corresponding language shorthand should be specified.

page_views <- 
  wp_trend( 
    page = c("Objetivos_de_Desarrollo_del_Milenio", "Millennium_Development_Goals") ,
    lang = c("es", "en"),
    from = "2015-01-01",
    to   = "2015-04-30"
  )
library(ggplot2)

ggplot(page_views, aes(x=date, y=count, group=lang, color=lang, fill=lang)) + 
  geom_smooth(size=1.5) + 
  geom_point() +
  theme_bw() 

3.4 file

This last option allows for storing the data retrieved by a call to wp_trend() in a file, e.g. file = "MyCache.csv". While MyCache.csv will be created if it does not exist already it will never be overwritten by wp_trend() thus allowing to accumulate data from susequent calls to wp_trend(). To get the data stored back into R use wp_load(file = "MyCache.csv").

wp_trend("Cheese", file="cheeeeese.csv", from = "2015-01-01", to="2015-01-30")
wp_trend("K\u00e4se", lang="de", file="cheeeeese.csv", from = "2015-01-01", to="2015-01-30")

cheeeeeese <- wp_load( file="cheeeeese.csv" )
cheeeeeese
##    date       count lang page      rank month  title 
## 49 2015-01-04  254  de   K%C3%A4se 6057 201501 Käse  
## 34 2015-01-10  226  de   K%C3%A4se 6057 201501 Käse  
## 35 2015-01-11  307  de   K%C3%A4se 6057 201501 Käse  
## 36 2015-01-12  380  de   K%C3%A4se 6057 201501 Käse  
## 39 2015-01-15  320  de   K%C3%A4se 6057 201501 Käse  
## 32 2015-01-18  290  de   K%C3%A4se 6057 201501 Käse  
## 58 2015-01-20  352  de   K%C3%A4se 6057 201501 Käse  
## 60 2015-01-22  352  de   K%C3%A4se 6057 201501 Käse  
## 54 2015-01-24  243  de   K%C3%A4se 6057 201501 Käse  
## 56 2015-01-26  457  de   K%C3%A4se 6057 201501 Käse  
## 55 2015-01-27  401  de   K%C3%A4se 6057 201501 Käse  
## 20 2015-01-02 1623  en   Cheese    705  201501 Cheese
## 19 2015-01-03 1727  en   Cheese    705  201501 Cheese
## 18 2015-01-04 1746  en   Cheese    705  201501 Cheese
## 17 2015-01-05 2239  en   Cheese    705  201501 Cheese
## 16 2015-01-06 2366  en   Cheese    705  201501 Cheese
## 15 2015-01-07 2301  en   Cheese    705  201501 Cheese
## 14 2015-01-08 2248  en   Cheese    705  201501 Cheese
## 13 2015-01-09 2240  en   Cheese    705  201501 Cheese
## 3  2015-01-10 1670  en   Cheese    705  201501 Cheese
## 4  2015-01-11 1731  en   Cheese    705  201501 Cheese
## 5  2015-01-12 2223  en   Cheese    705  201501 Cheese
## 7  2015-01-14 2358  en   Cheese    705  201501 Cheese
## 8  2015-01-15 2285  en   Cheese    705  201501 Cheese
## 9  2015-01-16 2261  en   Cheese    705  201501 Cheese
## 1  2015-01-18 1943  en   Cheese    705  201501 Cheese
## 2  2015-01-19 2353  en   Cheese    705  201501 Cheese
## 23 2015-01-24 1693  en   Cheese    705  201501 Cheese
## 30 2015-01-29 2483  en   Cheese    705  201501 Cheese
## 
## ... 33 rows of data not shown

4 Caching

4.1 Session caching

When using wp_trend() you will notice that subsequent calls to the function might take considerably less time than previous calls - given that later calls include data that has been downloaded already. This is due to the caching system running in the background and keeping track of things downloaded already. You can see if wp_trend() had to download something if it reports one or more links to the stats.grok.se server, e.g. …

wp_trend("Cheese", from = "2015-01-01", to = "2015-01-30")
## http://stats.grok.se/json/en/201501/Cheese

The current cache in memory can be accessed via:

wp_get_cache()
##      date       count    lang       page                   rank month 
## 6758 2015-04-26        0 ar         %D8%AA%D9%86%D8%B8 ... -1   201504
## 3166 2014-10-05       71 ar         %D8%AF%D8%A7%D8%B9 ... -1   201410
## 7377 2015-05-19     4393 de         Islamischer_Staat_ ... -1   201505
## 7402 2015-06-07     2823 de         Islamischer_Staat_ ... -1   201506
## 2902 2014-10-20    33747 en         Islamic_State_of_I ... -1   201410
## 724  2008-03-18      393 en         Millennium_Develop ... 7435 200803
## 773  2008-04-03      437 en         Millennium_Develop ... 7435 200804
## 865  2008-08-26      460 en         Millennium_Develop ... 7435 200808
## 909  2008-09-11      594 en         Millennium_Develop ... 7435 200809
## 948  2008-11-12      888 en         Millennium_Develop ... 7435 200811
## 1064 2009-03-30      800 en         Millennium_Develop ... 7435 200903
## 1124 2009-05-01      544 en         Millennium_Develop ... 7435 200905
## 1399 2010-02-03      968 en         Millennium_Develop ... 7435 201002
## 1479 2010-04-11      695 en         Millennium_Develop ... 7435 201004
## 1729 2010-12-01     2036 en         Millennium_Develop ... 7435 201012
## 2472 2013-01-18     2469 en         Millennium_Develop ... 7435 201301
## 466  2015-01-30     1389 en         Millennium_Develop ... 7435 201501
## 6335 2015-08-23      589 en         Millennium_Develop ... 7435 201508
## 4803 2011-09-19     3942 en         Syria                  1802 201109
## 5077 2012-06-23    12575 en         Syria                  1802 201206
## 7707 2014-09-21      612 ru         %D0%98%D1%81%D0%BB ... -1   201409
## 7745 2014-10-15      768 ru         %D0%98%D1%81%D0%BB ... -1   201410
## 4011 2014-10-18     5447 ru         %D0%98%D1%81%D0%BB ... -1   201410
## 7802 2014-12-19      303 ru         %D0%98%D1%81%D0%BB ... -1   201412
## 4103 2015-01-28     5549 ru         %D0%98%D1%81%D0%BB ... -1   201501
## 7865 2015-02-18      436 ru         %D0%98%D1%81%D0%BB ... -1   201502
## 8046 2015-08-11      186 ru         %D0%98%D1%81%D0%BB ... -1   201508
## 7106 2015-01-24        0 zh-min-nan Iraq_kap_Levant_Is ... -1   201501
## 7155 2015-03-15        0 zh-min-nan Iraq_kap_Levant_Is ... -1   201503
##      title                 
## 6758 تنظيم_الدولة_الإسل ...
## 3166 داعش                  
## 7377 Islamischer_Staat_ ...
## 7402 Islamischer_Staat_ ...
## 2902 Islamic_State_of_I ...
## 724  Millennium_Develop ...
## 773  Millennium_Develop ...
## 865  Millennium_Develop ...
## 909  Millennium_Develop ...
## 948  Millennium_Develop ...
## 1064 Millennium_Develop ...
## 1124 Millennium_Develop ...
## 1399 Millennium_Develop ...
## 1479 Millennium_Develop ...
## 1729 Millennium_Develop ...
## 2472 Millennium_Develop ...
## 466  Millennium_Develop ...
## 6335 Millennium_Develop ...
## 4803 Syria                 
## 5077 Syria                 
## 7707 Исламское_государс ...
## 7745 Исламское_государс ...
## 4011 Исламское_государс ...
## 7802 Исламское_государс ...
## 4103 Исламское_государс ...
## 7865 Исламское_государс ...
## 8046 Исламское_государс ...
## 7106 Iraq_kap_Levant_Is ...
## 7155 Iraq_kap_Levant_Is ...
## 
## ... 8212 rows of data not shown

… and emptied by a call to wp_cache_reset().

4.2 Caching across sessions 1

While everything that is downloaded during a session is cached in memory it might come handy to save the cache parallel on disk to reuse it in the next R session. To activate disk-caching for a session simply use:

wp_set_cache_file( file = "myCache.csv" )

The function will reload whatever is stored in the file and in subsequent calls to wp_trend() will automatically add data as it is downloaded. The file used for disk-caching can be changed by another call to wp_set_cache_file( file = "myOtherCache.csv") or turned off completely by leaving the file argument empty.

4.3 Caching across sessions 2

If disk-caching should be enabled by default one can define a path as system/environment variable WP_CACHE_FILE – e.g. put something like this into ~/Renviron: WP_CACHE_FILE=~/.wp_trend_cache.csv. When loading the package it will look for this variable via Sys.getenv("WP_CACHE_FILE") and use the path for caching as if …

wp_set_cache_file( Sys.getenv("WP_CACHE_FILE") )

.. would have been typed in by the user.

If the package finds a value for WP_CACHE_FILE it will mention that when laoding the package.

5 Counts for other languages

If comparing languages is important one needs to specify the exact article titles for each language: While the article about the Millennium Goals has an English title in the English Wikipedia, it of course is named differently in Spanish, German, Chinese, … . One might look these titles up by hand or use the handy wp_linked_pages() function like this:

titles <- wp_linked_pages("Islamic_State_of_Iraq_and_the_Levant", "en")
titles <- titles[titles$lang %in% c("de", "es", "ar", "ru"),]
titles 
##   page                   lang title                 
## 1 %D8%AA%D9%86%D8%B8 ... ar   تنظيم_الدولة_الإسل ...
## 2 Islamischer_Staat_ ... de   Islamischer_Staat_ ...
## 3 Estado_Isl%C3%A1mi ... es   Estado_Islámico       
## 4 %D0%98%D1%81%D0%BB ... ru   Исламское_государс ...

… then we can use the information to get data for several languages …

page_views <- 
  wp_trend(
    page = titles$page, 
    lang = titles$lang,
    from = "2015-06-01",
    to   = "2015-11-30"
  )
library(ggplot2)

for(i in unique(page_views$lang) ){
  iffer <- page_views$lang==i
  page_views[iffer, ]$count <- scale(page_views[iffer, ]$count)
}

ggplot(page_views, aes(x=date, y=count, group=lang, color=lang)) + 
  geom_line(size=1.2, alpha=0.5) + 
  ylab("standardized count\n(by lang: m=0, var=1)") +
  theme_bw() + 
  scale_colour_brewer(palette="Set1") + 
  guides(colour = guide_legend(override.aes = list(alpha = 1))) 

6 Going beyond Wikipediatrend – Anomalies and mean shifts

6.1 Identifying anomalies with AnomalyDetection

Currently the AnomalyDetection package is not availible on CRAN so we have to use install_github() from the devtools package to get it.

RUN_IT <- require(AnomalyDetection) & require(BreakoutDetection) & F
## Loading required package: AnomalyDetection
## Loading required package: BreakoutDetection
# install.packages( "AnomalyDetection", repos="https://ghrr.github.io/drat",  type="source")
library(AnomalyDetection)
library(dplyr)
library(ggplot2)

The package is a little picky about the data it accepts for processing so we have to build a new data frame. It should contain only the date and count variable. Furthermore, date should be named timestamp and transformed to type POSIXct.

page_views <- wp_trend("Syria", from = "2014-01-01", to="2015-11-30")

page_views_br <- 
  page_views  %>% 
  select(date, count)  %>% 
  rename(timestamp=date)  %>% 
  unclass()  %>% 
  as.data.frame() %>% 
  mutate(timestamp = as.POSIXct(timestamp))

Having transformed the data we can detect anomalies via AnomalyDetectionTs(). The function offers various options e.g. the significance level for rejecting normal values (alpha); the maximum fraction of the data that is allowed to be detected as anomalies (max_amoms); whether or not upward deviations, downward devaitions or irregularities in both directions might form the basis of anomaly detection (direction) and last but not least whether or not the time frame for detection is larger than one month (lonterm).

Lets choose a greedy set of parameters and detect possible anomalies:

res <- 
AnomalyDetectionTs(
  x         = page_views_br, 
  alpha     = 0.05, 
  max_anoms = 0.40,
  direction = "both",
  longterm  = T
)$anoms

res$timestamp <- as.Date(res$timestamp)

head(res)

… and play back the detected anomalies to our page_views data set:

page_views <- 
  page_views  %>% 
  mutate(normal = !(page_views$date %in% res$timestamp))  %>% 
  mutate(anom   =   page_views$date %in% res$timestamp )

class(page_views) <- c("wp_df", "data.frame")

Now we can plot counts and anomalies …

(
  p <-
    ggplot( data=page_views, aes(x=date, y=count) ) + 
      geom_line(color="steelblue") +
      geom_point(data=filter(page_views, anom==T), color="red2", size=2) +
      theme_bw()
)

… as well as compare running means:

p + 
  geom_line(stat = "smooth", size=2, color="red2", alpha=0.7) + 
  geom_line(data=filter(page_views, anom==F), 
  stat = "smooth", size=2, color="dodgerblue4", alpha=0.5) 

It seems like upward and downward anomalies partial each other out most of the time since both smooth lines (with and without anomalies) do not differ much. Nonetheless, keeping anomalies in will upward bias the counts slightly, so we proceed with a cleaned up data set:

page_views_clean <- 
  page_views  %>% 
  filter(anom==F)  %>% 
  select(date, count, lang, page, rank, month, title)

page_views_br_clean <- 
  page_views_br  %>% 
  filter(page_views$anom==F)

6.2 Identifying mean shifts with BreakoutDetection

BreakoutDetection is a package that allows to search data for mean level shifts by dividing it into timespans of change and those of stability in the presence of seasonal noise. Similar to AnomalyDetection the BreakoutDetection package is not available on CRAN but has to be obtained from Github.

# install.packages(  "BreakoutDetection",   repos="https://ghrr.github.io/drat", type="source")
library(BreakoutDetection)
library(dplyr)
library(ggplot2)
library(magrittr)

… again the workhorse function (breakout()) is picky and requires “a data.frame which has ‘timestamp’ and ‘count’ components” like our page_views_br_clean.

The function has two general options: one tweaks the minimum length of a timespan (min.size); the other one does determine how many mean level changes might occur during the whole time frame (method); and several method specific options, e.g. decree, beta, and percent which control the sensitivity adding further breakpoints. In the following case the last option tells the function that overall model fit should be increased by at least 5 percent if adding a breakpoint.

br <- 
  breakout(
    page_views_br_clean, 
    min.size = 30, 
    method   = 'multi', 
    percent  = 0.05,
    plot     = TRUE
  )
br

In the following snippet we combine the break information with our page views data and can have a look at the dates at which the breaks occured.

breaks <- page_views_clean[br$loc,]
breaks

Next, we add a span variable capturing which page_view observations belong to which span, allowing us to aggregate data.

page_views_clean$span <- 0
for (d in breaks$date ) {
  page_views_clean$span[ page_views_clean$date > d ] %<>% add(1)
}

page_views_clean$mcount <- 0
for (s in unique(page_views_clean$span) ) {
  iffer <- page_views_clean$span == s
page_views_clean$mcount[ iffer ] <- mean(page_views_clean$count[iffer])
}

spans <- 
  page_views_clean  %>% 
  as_data_frame() %>% 
  group_by(span) %>% 
  summarize(
    start      = min(date), 
    end        = max(date), 
    length     = end-start,
    mean_count = round(mean(count)),
    min_count  = min(count),
    max_count  = max(count),
    var_count  = var(count)
  )
spans

Also, we can now plot the shifting mean.

ggplot(page_views_clean, aes(x=date, y=count) ) + 
  geom_line(alpha=0.5, color="steelblue") + 
  geom_line(aes(y=mcount), alpha=0.5, color="red2", size=1.2) + 
  theme_bw()