webtrackR

CRAN status R-CMD-check Codecov test coverage

webtrackR is an R package to preprocess and analyse web tracking data in conjunction with survey data of panelists. The package is built on top of data.table and can thus comfortably handle very large web tracking datasets

Installation

You can install the development version of webtrackR from GitHub with:

# install.packages("devtools")
devtools::install_github("schochastics/webtrackR")

S3 class

The package adds a S3 class called wt_dt which inherits most of the functionality from the data.table class A summary and print method are included in the package

Preprocessing

raw web tracking data is assumed to have (at least) the following variables:

All preprocessing functions check if these are present. Otherwise an error is thrown.

Several other variables can be derived from these with the package:

A typical workflow looks like this:

# load webtrack data as data.table
library(data.table)
library(webtrackR)

# webtrack data
wt <- fread("<path/to/file>")

# domain dictionary (there is also an inbuilt dictionary)
domain_dict <- fread("<path/to/file>")

# dummy file (should just be a vecor of urls)
political_urls <- c("...")

# survey data
survey <- fread("<path/to/file>")

# convert to wt_dt object
wt <- as.wt_dt(wt)

wt <- add_duration(wt)
wt <- extract_domain(wt)

# classify domains and only return rows with type news
wt <- classify_domains(wt, domain_classes = domain_dict, return.only = "news")

# create a dummy variable for political news
wt <- create_urldummy(wt, dummy = political_urls, name = "political")

# add survey data
wt <- add_panelist_data(wt, data = survey)

Analysis

Ideology

Top 500 Bakshy scores are available in the package

data("bakshy")

Audience Networks

Create audiences network

audience_network(wt, cutoff = 3, type = "pmi")