webtrackR is an R package to preprocess and analyse web tracking data in conjunction with survey data of panelists. The package is built on top of data.table and can thus comfortably handle very large web tracking datasets
You can install the development version of webtrackR from GitHub with:
# install.packages("devtools")
::install_github("schochastics/webtrackR") devtools
The package adds a S3 class called wt_dt
which inherits
most of the functionality from the data.table class A
summary
and print
method are included in the
package
raw web tracking data is assumed to have (at least) the following variables:
All preprocessing functions check if these are present. Otherwise an error is thrown.
Several other variables can be derived from these with the package:
add_duration()
and aggregate_duration()
to
summarize consecutive visits to the same website)extract_domain()
)classify_domains()
)create_urldummy()
)add_panelist_data()
)A typical workflow looks like this:
# load webtrack data as data.table
library(data.table)
library(webtrackR)
# webtrack data
<- fread("<path/to/file>")
wt
# domain dictionary (there is also an inbuilt dictionary)
<- fread("<path/to/file>")
domain_dict
# dummy file (should just be a vecor of urls)
<- c("...")
political_urls
# survey data
<- fread("<path/to/file>")
survey
# convert to wt_dt object
<- as.wt_dt(wt)
wt
<- add_duration(wt)
wt <- extract_domain(wt)
wt
# classify domains and only return rows with type news
<- classify_domains(wt, domain_classes = domain_dict, return.only = "news")
wt
# create a dummy variable for political news
<- create_urldummy(wt, dummy = political_urls, name = "political")
wt
# add survey data
<- add_panelist_data(wt, data = survey) wt
Top 500 Bakshy scores are available in the package
data("bakshy")
Create audiences network
audience_network(wt, cutoff = 3, type = "pmi")
cutoff
indicates minimal duration to count as
visit.type
can be one of “pmi”, “phi”, “disparity”, “sdsm”,
or “fdsm”