Thomas Leeper, Scott Chamberlain, Patrick Mair, Karthik Ram, Christopher Gandrud
thosjleeper at gmail.com
This Task View contains information about to use R and the world wide web together. The base version of R does not ship with many tools for interacting with the web. Thankfully, there are an increasingly large number of tools for interacting with the web. This task view focuses on packages for obtaining web-based data and information, frameworks for building web-based R applications, and online services that can be accessed from R. A list of available packages and functions is presented below, grouped by the type of activity. The
Open Data Task View
provides further discussion of online data sources that can be accessed from R.
If you have any comments or suggestions for additions or improvements for this Task View, go to GitHub and
submit an issue
, or make some changes and
submit a pull request
. If you can't contribute on GitHub,
. If you have an issue with one of the packages discussed below, please contact the maintainer of that package. If you know of a web service, API, data source, or other online resource that is not yet supported by an R package, consider adding it to
the package development to do list on GitHub
Tools for Working with the Web from R
Core Tools For HTTP Requests
There are two packages that should cover most use cases of interacting with the web from R.
provides a user-friendly interface for executing HTTP methods (GET, POST, PUT, HEAD, DELETE, etc.) and provides support for modern web authentication protocols (OAuth 1.0, OAuth 2.0). HTTP status codes are helpful for debugging HTTP calls. httr makes this easier using, for example,
stop_for_status(), which gets the http status code from a response object, and stops the function if the call was not successful. (See also
warn_for_status().) Note that you can pass in additional libcurl options to the
parameter in http calls.
is a lower-level package that provides a closer interface between R and the
libcurl C library
, but is less user-friendly. It may be useful for operations on web-based XML or to perform FTP operations. For more specific situations, the following resources may be useful:
is another libcurl client that provides the
function as an SSL-compatible replacement for base R's
and support for http 2.0, ssl (https, ftps), gzip, deflate and more. For websites serving insecure HTTP (i.e. using the "http" not "https" prefix), most R functions can extract data directly, including
read.csv; this also applies to functions in add-on packages such as
is another low-level package for HTTP requests that implements the GET, POST and multipart POST verbs.
is a useful package for converting curl command-line code, for example from a browser developer's console, into R code.
) provides a high-level package that is useful for developing other API client packages.
) provides simplified tools to ping and time HTTP requests, around httr calls.
) provides a mechanism for caching HTTP requests.
For dynamically generated webpages (i.e., those requiring user interaction to display results),
) can be used to automate those interactions and extract page contents. It provides a set of bindings for the Selenium 2.0 webdriver using the
. It can also aid in automated application testing, load testing, and web scraping.
(not on CRAN) uses
to access a webpage's Document Object Model (DOM).
Another, higher-level alternative package useful for webscraping is
), which is designed to work with
to make it easy to express common web scraping tasks.
Many base R tools can be used to download web content, provided that the website does not use SSL (i.e., the URL does not have the "https" prefix).
is a general purpose function that can be used to download a remote file. For SSL, the
download.file(), and takes all the same arguments.
Tabular data sets (e.g., txt, csv, etc.) can be input using
read.csv(), and friends, again assuming that the files are not hosted via SSL. An alternative is to use
RCurl::getURL) to first read the file into R as a character vector before parsing with
read.table(text=...), or you can download the file to a local directory.
) provides an
function that can read a number of common data formats directly from an https:// URL. The
can load and cache plain-text data from a URL (either http or https). That package also includes
for downloading/caching plain-text data from non-public Dropbox folders and
for downloading/caching Excel xlsx sheets.
: Using web resources can require authentication, either via API keys, OAuth, username:password combination, or via other means. Additionally, sometimes web resources that require authentication be in the header of an http call, which requires a little bit of extra work. API keys and username:password combos can be combined within a url for a call to a web resource (api key: http://api.foo.org/?key=yourkey; user/pass: http://username:email@example.com), or can be specified via commands in
httr. OAuth is the most complicated authentication process, and can be most easily done using
httr. See the 6 demos within
httr, three for OAuth 1.0 (linkedin, twitter, vimeo) and three for OAuth 2.0 (facebook, GitHub, google).
is a package that provides a separate R interface to OAuth. OAuth is easier to to do in
httr, so start there.
provides an OAuth 2.0 setup specifically for Google web services.
Parsing Structured Web Data
: There are two packages for working with XML:
). Both support general XML (and HTML) parsing, including XPath queries. The package
is less fully featured, but more user friendly with respect to memory management, classes (e.g., XML node vs. node set vs. document), and namespaces. Of the two, only the
creation of XML nodes and documents. The
) package is a collection of convenient functions for coercing XML into data frames. An alternative to
, which parses CSS3 Selectors and translates them to XPath 1.0 expressions.
package is often used for parsing xml and html, but selectr translates CSS selectors to XPath, so can use the CSS selectors instead of XPath. The
selectorgadget browser extension
can be used to identify page elements.
reads HTML documents and obtains a description of each of the forms it contains, along with the different elements and hidden fields.
provides additional tools for scraping data from HTML and XML documents.
extracts structured information from HTML tables, similar to
package, but automatically expands row and column spans in the header and body cells, and users are given more control over the identification of header and body rows which will end up in the R table.
function can be used to extract portions of a URL. The
functions can be used to encode character strings for use in URLs.
decodes back to the original strings.
) can also handle URL encoding, decoding, parsing, and parameter extraction.
(not on CRAN) provides access to Google's secure HTTP-based DNS resolution service.
Tools for Working with Scraped Webpage Contents
Several packages can be used for parsing HTML documents.
provides generic extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe Java library.
interfaces to the libtidy library for correcting HTML documents that are not well-formed. This library corrects common errors in HTML documents.
provides an R Interface to W3C Markup Validation Services for validating HTML documents.
For XML documents, the
package provides facilities in R for reading XML schema documents and processing them to create definitions for R classes and functions for converting XML nodes to instances of those classes. It provides the framework for meta-computing with XML schema in R.
is a package providing an interface to the
an XML processing library that provides an XSLT engine for transforming XML data using a transform stylesheet. (It can be seen as a modern replacement for
Sxslt, which is an interface to Dan Veillard's libxslt translator, and the
package.) This may be useful for webscraping, as well as transforming XML markup into another human- or machine-readable format (e.g., HTML, JSON, plain text, etc.).
provides a client-side SOAP (Simple Object Access Protocol) mechanism. It aims to provide a high-level interface to invoke SOAP methods provided by a SOAP server.
provides an implementation of XML-RPC, a relatively simple remote procedure call mechanism that uses HTTP and XML. This can be used for communicating between processes on a single machine or for accessing Web services from within R.
(not on CRAN): Interface to zlib and bzip2 libraries for performing in-memory compression and decompression in R. This is useful when receiving or sending contents to remote servers, e.g. Web services, HTTP requests via RCurl.
tm.plugin.webmining: Extensible text retrieval framework for news feeds in XML (RSS, ATOM) and JSON formats. Currently, the following feeds are implemented: Google Blog Search, Google Finance, Google News, NYTimes Article Search, Reuters News Feed, Yahoo Finance and Yahoo Inplay.
to provide screenshots of web pages without a browser. It can be useful for testing websites (such as Shiny applications).
Other Useful Packages and Functions
is an interface to Apache Commons Email to send emails from within R.
provides a simple SMTP client.
provides access the Google's gmail.com RESTful API.
) contains various functions for developing web applications, including parsers for
as well as
) guesses the MIME type for a file from its extension.
) provides tools to read data and metadata documents exchanged through the Statistical Data and Metadata Exchange (SDMX) framework. The package currently focuses on the SDMX XML standard format (SDMX-ML).
(not on CRAN) provides R6 classes for parsing and checking robots.txt files.
Web and Server Frameworks
is a server-based framework for integrating R into other applications via Web Services.
package makes it easy to build interactive web applications with R.
(not on CRAN) is new, light weight web framework that uses magrittr-style syntax and is modeled after
(not on CRAN) provides an iPython notebook-style web-based R interface.
web server interface contains the specification and convenience software for building and running Rook applications.
framework for embedded statistical computation and reproducible research exposes a web API interfacing R, LaTeX and Pandoc. This API is used for example to integrate statistical functionality into systems, share and execute scripts or reports on centralized servers, and build R based apps.
provide server and client functionality for TCP/IP or local socket interfaces.
provides a low-level socket and protocol support for handling HTTP and WebSocket requests directly within R. Another related package, perhaps which
provides a simple HTTP server to serve files under a given directory based on httpuv.
package provides tools to process Web Application Description Language (WADL) documents and to programmatically generate R functions to interface to the REST methods described in those WADL documents. (not on CRAN)
provides a mechanism to export R objects as (D)COM objects in Windows. It can be used along with the
package which provides user-level access from R to other COM servers. (not on CRAN)
provides an online environment (SaaS) to host and run
statistical report templates in the cloud.
(not on CRAN) allows one to use R scripts as CGI programs for generating dynamic Web content. HTML forms and other mechanisms to submit dynamic requests can be used to provide input to R scripts via the Web to create content that is determined within that R script.
Amazon Web Services is a popular, proprietary cloud service offering a suite of computing, storage, and infrastructure tools.
provides functionality for generating AWS API request signatures.
Simple Storage Service (S3)
is a commercial server that allows one to store content and retrieve it from any machine connected to the Internet.
(not on CRAN) provides basic infrastructure for communicating with S3.
) interacts with S3 and EC2 using the AWS command line interface (an external system dependency). The CRAN version is archived.
(not on CRAN) is another package using the AWS Command Line Interface to control EC2 and S3, which is only available for Linux and Mac OS.
Elastic Cloud Compute (EC2)
is a cloud computing service. AWS.tools and
(not on CRAN) both use the AWS command line interface to control EC2.
(not on CRAN) is another package for managing EC2 instances and S3 storage, which includes a parallel version of
for the Elastic Map Reduce (EMR) engine called
emrlapply(). It uses Hadoop Streaming on Amazon's EMR in order to get simple parallel computation.
provides an interface to Amazon's Simple DB API.
The cloudyr project
, which is currently under active development on GitHub, aims to provide a unified interface to the full Amazon Web Services suite without the need for external system dependencies.
) is a lightweight, high-level interface for the
; not on CRAN) is a Dropbox interface that provides access to a full suite of file operations, including dir/copy/move/delete operations, account information (including quotas) and the ability to upload and download files from any Dropbox account.
) provides access to the Backblaze B2 storage API.
is a general purpose client for the Digital Ocean v2 API. In addition, the package includes functions to install various R tools including base R, RStudio server, and more. There's an improving interface to interact with docker on your remote droplets via this package.
) works with GitHub gists (
) from R, allowing you to create new gists, update gists with new files, rename files, delete files, get and delete gists, star and un-star gists, fork gists, open a gist in your default browser, get embed code for a gist, list gist commits, and get rate limit information when authenticated.
provides bindings to the git version control system and
(not on CRAN) provides access to the GitHub.com API, both of which can facilitate code or data sharing via GitHub.
Google Drive/Google Documents
(not on CRAN) is a thin client for the Google Drive API. The
package is an example of using the RCurl and XML packages to quickly develop an interface to the Google Documents API.
provides programmatic access to the Google Storage API. This allows R users to access and store data on Google's storage. We can upload and download content, create, list and delete folders/buckets, and set access control permissions on objects and buckets.
) can access private or public Google Sheets by title, key, or URL. Extract data or edit data. Create, delete, rename, copy, upload, or download spreadsheets and worksheets.
) can download Google Sheets using just the sharing link. Spreadsheets can be downloaded as a data frame, or as plain text to parse manually.
) is a package to share plots using the image hosting service
. knitr also has a function
to load images from literate programming documents.
: Amazon Mechanical Turk is a paid crowdsourcing platform that can be used to semi-automate tasks that are not easily automated.
)) provides access to the Amazon Mechanical Turk Requester API.
(not on CRAN) can distribute tasks and retrieve results for the Microworkers.com platform.
provides infrastructure to access OpenStreetMap data from different sources to work with the data in common R manner and to convert data into available infrastructure provided by existing R packages (e.g., into sp and igraph objects).
) provides shortest paths and travel times from OpenStreetMap.
serves two purposes: it provides a comfortable R interface to query the Google server for static maps, and use the map as a background image to overlay plots within R.
allows to create R graphics in Keyhole Markup Language (KML) format in a manner that allows them to be displayed on Google Earth (or Google Maps), and
provides users with high-level facilities to generate KML.
can visualization spatial and spatio-temporal objects in Google Earth.
pls SP or SPT (STDIF,STFDF) data as an HTML map mashup over Google Maps.
allows for the easy visualization of spatial data and models on top of Google Maps, OpenStreetMaps, Stamen Maps, or CloudMade Maps using ggplot2.
(not on CRAN) provides an API interface to
: Plot.ly is a company that allows you to create visualizations in the web using R (and Python). They have an R package in development
(not on CRAN), as well as access to their services via
a REST API
provides an interface between R and the Google chart tools. The
makes it easy to describe interactive web graphics in R. It fuses the ideas of ggplot2 and
shiny, rendering graphics on the web with Vega.
(not on CRAN) and
(not on CRAN) is an R wrapper for Vega.
provides an interface to the Twitter web API.
(not on CRAN) is yet another Twitter client.
(not on CRAN) focuses on report generation based on Twitter data.
provides a series of functions that allow users to access Twitter's filter, sample, and user streams, and to parse the output into data frames. OAuth authentication is supported.
is an alternative iplementation geared toward SQLite and postGIS databases.
produces a network graph from a data.frame of tweets.
(not on CRAN) implements a political ideology scaling measure for specified Twitter users.
Web Analytics Services
(not on CRAN) offers functions to perform and display Google Trends queries. Another GitHub package (
) is now deprecated, but supported a previous version of Google Trends and may still be useful for developers.
provides another alternative.
provides an easy-to-use interface for the Pushbullet service which provides fast and efficient notifications between computers, phones and tablets.
) can sending push notifications to mobile devices (iOS and Android) and desktop using
imports and manage BibTeX and BibLaTeX references with RefManager.
: Implementation of the Mendeley API in R. Archived on CRAN. It's been archived on CRAN temporarily until it is updated for the new Mendeley API.
(not on CRAN) can get scholarly metadata from around the web.
) is a programmatic interface the
API, which can be used for identifying scientific authors and their publications (e.g., by DOI).
is a programmatic interface to the Web Service methods provided by the Public Library of Science journals for search.
(not on CRAN) provides tools for extracting and processing Pubmed and Pubmed Central records.
provides functions to extract citation data from Google Scholar. Convenience functions are also provided for comparing multiple scholars and predicting future h-index values.
is a package for text mining of
that supports fetching text and XML from PubMed.
) connects to
harvest metadata using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standard.
(Not on CRAN) provides simple text mining of journal articles from JSTOR's Data for Research service.
) is a client for the arXiv API, a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics.
is a wrapper to the almetrics API platform developed by PLoS.
(not on CRAN): An interface to Google Fusion Tables. Google Fusion Tables is a data management system in the cloud. This package provides functions to browse Fusion Tables catalog, retrieve data from Gusion Tables dtd storage to R and to upload data from R to Fusion Tables
jSonarR: Enables users to access MongoDB by running queries and returning their results in data.frames. jSonarR uses data processing and conversion capabilities in the jSonar Analytics Platform and the
JSON Studio Gateway
, to convert JSON to a tabular format.
allows both public and private API calls to interact with Bitcoin.
is a package for the
API. From their website: "Bitcoincharts provides financial and technical data related to the Bitcoin network and this data can be accessed via a JSON application programming interface (API)."
) wraps the API for the
crypto-currency trading platform.
; not on CRAN): A generic R client to interact with any ERDDAP instance, which is a special case of OPeNDAP (
Open-source Project for a Network Data Access Protocol
. Allows user to swap out the base URL to use any ERDDAP instance.