Introduction to rzeit2

Jan Dix

2019-01-07

Authentication

You need an API key to access the content endpoint. You can apply for a key at http://developer.zeit.de/quickstart/. The function below appends the key automatically to your R environment file. Hence, every time you start R the key is loaded. get_content and get_content_all access your key automatically by executing Sys.getenv("ZEIT_ONLINE_KEY"). Replace api_key with the key you receive from ZEIT and path with the location of your R environment.

# save the api key in the .Renviron file
set_api_key(api_key = "xxx", 
            path = "~/.Renviron")

Download meta data

Important: This function requires an API key. See Authentication.

You can query the whole ZEIT and ZEIT ONLINE archive using the content endpoint. Define your query using the query argument. The API supports query syntax such as + and AND/OR. You find more information at the ZEIT API documentation. As stated above, get_content and get_content_all access your key automatically by executing Sys.getenv("ZEIT_ONLINE_KEY"), but you can provide a key by your own using the api_key argument.

Why are there two functions?

rzeit2 provides two functions to query the content endpoint: get_content and get_content_all. get_content supports only 1000 rows per call because of the API specifications. You can use get_content_all to derive all articles matching your query. Internally, get_content_all executes get_content but fills all the arguments automatically. Please set a timeout to be polite.

# fetch articles up to 1000 rows
tatort_articles <- get_content(query = "Tatort",
                               begin_date = "20180101",
                               end_date = "20180131")

# fetch ALL articles
tatort_articles <- get_content(query = "Tatort",
                               timeout = 2)

Download text (vectorized)

get_article_text allows you to download the article text for a given url. The function is vectorized. Hence, you can define multiple urls. The function automatically scrapes all pages if the article has multiple pages. Please set a timeout to be polite.

tatort_content <- get_article_text(url = tatort_articles$content$href, 
                                   timeout = 1)

Download comments (not vectorized)

get_article_comments allows you to download the article comments for a given url. The function is not vectorized. This function may take quite long due to the ZEIT ONLINE website structure. Since ZEIT does not provide an API for comments this function downloads all comments as your browser would do. Please set a timeout to be polite.

tatort_comments <- get_article_comments(url = tatort_articles$content$href[1], 
                                        timeout = 1)

Download images (vectorized)

get_article_images allows you to download the article images for a given url. The function is vectorized. Hence, you can define multiple urls. The function automatically scrapes all pages if the article has multiple pages. You can directly download the images if you define a path in download. Please ensure that the folder you define exists. The file name is derived as md5 hash of the image url. Hence, it should be un Please set a timeout to be polite.

Please set a timeout to be polite.

tatort_images <- get_article_images(url = tatort_articles$content$href, 
                                    timeout = 1, 
                                    download = "~/Documents/tatort-img/")