The curl package provides bindings to the libcurl C library for R. The package supports retrieving data in-memory, downloading to disk, or streaming using the R “connection” interface. Some knowledge of curl is recommended to use this package. For a more user-friendly HTTP client, have a look at the httr package which builds on curl with HTTP specific tools and logic.

Request interfaces

The curl package implements several interfaces to retrieve data from a URL:

  • curl_fetch_memory() saves response in memory
  • curl_download() or curl_fetch_disk() writes response to disk
  • curl() or curl_fetch_stream() streams response data
  • curl_fetch_multi() (Advanced) process responses via callback functions

Each interface performs the same HTTP request, they only differ in how response data is processed.

Getting in memory

The curl_fetch_memory function is a blocking interface which waits for the request to complete and returns a list with all content (data, headers, status, timings) of the server response.

req <- curl_fetch_memory("https://httpbin.org/get")
str(req)
List of 6
 $ url        : chr "https://httpbin.org/get"
 $ status_code: int 200
 $ headers    : raw [1:303] 48 54 54 50 ...
 $ modified   : POSIXct[1:1], format: NA
 $ times      : Named num [1:6] 0 0.0317 0.1293 0.4425 0.5442 ...
  ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
 $ content    : raw [1:300] 7b 0a 20 20 ...
parse_headers(req$headers)
 [1] "HTTP/1.1 200 OK"                        "Connection: keep-alive"                
 [3] "Server: meinheld/0.6.1"                 "Date: Thu, 05 Oct 2017 11:21:50 GMT"   
 [5] "Content-Type: application/json"         "Access-Control-Allow-Origin: *"        
 [7] "Access-Control-Allow-Credentials: true" "X-Powered-By: Flask"                   
 [9] "X-Processed-Time: 0.000731945037842"    "Content-Length: 300"                   
[11] "Via: 1.1 vegur"                        
cat(rawToChar(req$content))
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "R (3.4.1 x86_64-apple-darwin15.6.0 x86_64 darwin15.6.0)"
  }, 
  "origin": "80.101.61.181", 
  "url": "https://httpbin.org/get"
}

The curl_fetch_memory interface is the easiest interface and most powerful for building API clients. However it is not suitable for downloading really large files because it is fully in-memory. If you are expecting 100G of data, you probably need one of the other interfaces.

Downloading to disk

The second method is curl_download, which has been designed as a drop-in replacement for download.file in r-base. It writes the response straight to disk, which is useful for downloading (large) files.

tmp <- tempfile()
curl_download("https://httpbin.org/get", tmp)
cat(readLines(tmp), sep = "\n")
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "R (3.4.1 x86_64-apple-darwin15.6.0 x86_64 darwin15.6.0)"
  }, 
  "origin": "80.101.61.181", 
  "url": "https://httpbin.org/get"
}

Streaming data

The most flexible interface is the curl function, which has been designed as a drop-in replacement for base url. It will create a so-called connection object, which allows for incremental (asynchronous) reading of the response.

con <- curl("https://httpbin.org/get")
open(con)

# Get 3 lines
out <- readLines(con, n = 3)
cat(out, sep = "\n")
{
  "args": {}, 
  "headers": {
# Get 3 more lines
out <- readLines(con, n = 3)
cat(out, sep = "\n")
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
# Get remaining lines
out <- readLines(con)
close(con)
cat(out, sep = "\n")
    "Host": "httpbin.org", 
    "User-Agent": "R (3.4.1 x86_64-apple-darwin15.6.0 x86_64 darwin15.6.0)"
  }, 
  "origin": "80.101.61.181", 
  "url": "https://httpbin.org/get"
}

The example shows how to use readLines on an opened connection to read n lines at a time. Similarly readBin is used to read n bytes at a time for stream parsing binary data.

Non blocking connections

As of version 2.3 it is also possible to open connetions in non-blocking mode. In this case readBin and readLines will return immediately with data that is available without waiting. For non-blocking connections we use isIncomplete to check if the download has completed yet.

con <- curl("https://httpbin.org/drip?duration=1&numbytes=50")
open(con, "rb", blocking = FALSE)
while(isIncomplete(con)){
  buf <- readBin(con, raw(), 1024)
  if(length(buf)) 
    cat("received: ", rawToChar(buf), "\n")
}
received:  ************************************************** 
close(con)

The curl_fetch_stream function provides a very simple wrapper around a non-blocking connection.

Async requests

As of curl 2.0 the package provides an async interface which can perform multiple simultaneous requests concurrently. The curl_fetch_multi adds a request to a pool and returns immediately; it does not actually perform the request.

pool <- new_pool()
cb <- function(req){cat("done:", req$url, ": HTTP:", req$status, "\n")}
curl_fetch_multi('https://www.google.com', done = cb, pool = pool)
curl_fetch_multi('https://cloud.r-project.org', done = cb, pool = pool)
curl_fetch_multi('https://httpbin.org/blabla', done = cb, pool = pool)

When we call multi_run(), all scheduled requests are performed concurrently. The callback functions get triggered when each request completes.

# This actually performs requests:
out <- multi_run(pool = pool)
done: https://www.google.nl/?gfe_rd=cr&dcr=0&ei=0BXWWa-SGqGk8wfAlo7YAw : HTTP: 200 
done: https://httpbin.org/blabla : HTTP: 404 
done: https://cloud.r-project.org/ : HTTP: 200 
print(out)
$success
[1] 3

$error
[1] 0

$pending
[1] 0

The system allows for running many concurrent non-blocking requests. However it is quite complex and requires careful specification of handler functions.

Exception handling

A HTTP requests can encounter two types of errors:

  1. Connection failure: network down, host not found, invalid SSL certificate, etc
  2. HTTP non-success status: 401 (DENIED), 404 (NOT FOUND), 503 (SERVER PROBLEM), etc

The first type of errors (connection failures) will always raise an error in R for each interface. However if the requests succeeds and the server returns a non-success HTTP status code, only curl() and curl_download() will raise an error. Let’s dive a little deeper into this.

Error automatically

The curl and curl_download functions are safest to use because they automatically raise an error if the request was completed but the server returned a non-success (400 or higher) HTTP status. This mimics behavior of base functions url and download.file. Therefore we can safely write code like this:

# This is OK
curl_download('https://cran.r-project.org/CRAN_mirrors.csv', 'mirrors.csv')
mirros <- read.csv('mirrors.csv')
unlink('mirrors.csv')

If the HTTP request was unsuccessful, R will not continue:

# Oops! A typo in the URL!
curl_download('https://cran.r-project.org/CRAN_mirrorZ.csv', 'mirrors.csv')
Error in curl_download("https://cran.r-project.org/CRAN_mirrorZ.csv", : HTTP error 404.
con <- curl('https://cran.r-project.org/CRAN_mirrorZ.csv')
open(con)
Error in open.connection(con): HTTP error 404.

Check manually

When using any of the curl_fetch_* functions it is important to realize that these do not raise an error if the request was completed but returned a non-200 status code. When using curl_fetch_memory or curl_fetch_disk you need to implement such application logic yourself and check if the response was successful.

req <- curl_fetch_memory('https://cran.r-project.org/CRAN_mirrors.csv')
print(req$status_code)
[1] 200

Same for downloading to disk. If you do not check your status, you might have downloaded an error page!

# Oops a typo!
req <- curl_fetch_disk('https://cran.r-project.org/CRAN_mirrorZ.csv', 'mirrors.csv')
print(req$status_code)
[1] 404
# This is not the CSV file we were expecting!
head(readLines('mirrors.csv'))
[1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"                               
[2] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\""               
[3] "  \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"                 
[4] "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en\" xml:lang=\"en\">"
[5] "<head>"                                                                   
[6] "<title>Object not found!</title>"                                         
unlink('mirrors.csv')

If you do want the curl_fetch_* functions to automatically raise an error, you should set the FAILONERROR option to TRUE in the handle of the request.

h <- new_handle(failonerror = TRUE)
curl_fetch_memory('https://cran.r-project.org/CRAN_mirrorZ.csv', handle = h)
Error in curl_fetch_memory("https://cran.r-project.org/CRAN_mirrorZ.csv", : The requested URL returned error: 404 Not Found

Customizing requests

By default libcurl uses HTTP GET to issue a request to an HTTP url. To send a customized request, we first need to create and configure a curl handle object that is passed to the specific download interface.

Configuring a handle

Creating a new handle is done using new_handle. After creating a handle object, we can set the libcurl options and http request headers.

h <- new_handle()
handle_setopt(h, copypostfields = "moo=moomooo");
handle_setheaders(h,
  "Content-Type" = "text/moo",
  "Cache-Control" = "no-cache",
  "User-Agent" = "A cow"
)

Use the curl_options() function to get a list of the options supported by your version of libcurl. The libcurl documentation explains what each option does. Option names are not case sensitive.

After the handle has been configured, it can be used with any of the download interfaces to perform the request. For example curl_fetch_memory will load store the output of the request in memory:

req <- curl_fetch_memory("http://httpbin.org/post", handle = h)
cat(rawToChar(req$content))
{
  "args": {}, 
  "data": "moo=moomooo", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Cache-Control": "no-cache", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "text/moo", 
    "Host": "httpbin.org", 
    "User-Agent": "A cow"
  }, 
  "json": null, 
  "origin": "80.101.61.181", 
  "url": "http://httpbin.org/post"
}

Alternatively we can use curl() to read the data of via a connection interface:

con <- curl("http://httpbin.org/post", handle = h)
cat(readLines(con), sep = "\n")
{
  "args": {}, 
  "data": "moo=moomooo", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Cache-Control": "no-cache", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "text/moo", 
    "Host": "httpbin.org", 
    "User-Agent": "A cow"
  }, 
  "json": null, 
  "origin": "80.101.61.181", 
  "url": "http://httpbin.org/post"
}

Or we can use curl_download to write the response to disk:

tmp <- tempfile()
curl_download("http://httpbin.org/post", destfile = tmp, handle = h)
cat(readLines(tmp), sep = "\n")
{
  "args": {}, 
  "data": "moo=moomooo", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Cache-Control": "no-cache", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "text/moo", 
    "Host": "httpbin.org", 
    "User-Agent": "A cow"
  }, 
  "json": null, 
  "origin": "80.101.61.181", 
  "url": "http://httpbin.org/post"
}

Or perform the same request with a multi pool:

curl_fetch_multi("http://httpbin.org/post", handle = h, done = function(res){
  cat("Request complete! Response content:\n")
  cat(rawToChar(res$content))
})

# Perform the request
out <- multi_run()
Request complete! Response content:
{
  "args": {}, 
  "data": "moo=moomooo", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Cache-Control": "no-cache", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "text/moo", 
    "Host": "httpbin.org", 
    "User-Agent": "A cow"
  }, 
  "json": null, 
  "origin": "80.101.61.181", 
  "url": "http://httpbin.org/post"
}

Reading cookies

Curl handles automatically keep track of cookies set by the server. At any given point we can use handle_cookies to see a list of current cookies in the handle.

# Start with a fresh handle
h <- new_handle()

# Ask server to set some cookies
req <- curl_fetch_memory("http://httpbin.org/cookies/set?foo=123&bar=ftw", handle = h)
req <- curl_fetch_memory("http://httpbin.org/cookies/set?baz=moooo", handle = h)
handle_cookies(h)
       domain  flag path secure expiration name value
1 httpbin.org FALSE    /  FALSE       <NA>  bar   ftw
2 httpbin.org FALSE    /  FALSE       <NA>  foo   123
3 httpbin.org FALSE    /  FALSE       <NA>  baz moooo
# Unset a cookie
req <- curl_fetch_memory("http://httpbin.org/cookies/delete?foo", handle = h)
handle_cookies(h)
       domain  flag path secure          expiration name value
1 httpbin.org FALSE    /  FALSE                <NA>  bar   ftw
2 httpbin.org FALSE    /  FALSE 2017-10-05 11:21:55  foo  <NA>
3 httpbin.org FALSE    /  FALSE                <NA>  baz moooo

The handle_cookies function returns a data frame with 7 columns as specified in the netscape cookie file format.

On reusing handles

In most cases you should not re-use a single handle object for more than one request. The only benefit of reusing a handle for multiple requests is to keep track of cookies set by the server (seen above). This could be needed if your server uses session cookies, but this is rare these days. Most APIs set state explicitly via http headers or parameters, rather than implicitly via cookies.

In recent versions of the curl package there are no performance benefits of reusing handles. The overhead of creating and configuring a new handle object is negligible. The safest way to issue multiple requests, either to a single server or multiple servers is by using a separate handle for each request (which is the default)

req1 <- curl_fetch_memory("https://httpbin.org/get")
req2 <- curl_fetch_memory("http://www.r-project.org")

In past versions of this package you needed to manually use a handle to take advantage of http Keep-Alive. However as of version 2.3 this is no longer the case: curl automatically maintains global a pool of open http connections shared by all handles. When performing many requests to the same server, curl automatically uses existing connections when possible, eliminati