Databases

As well as working with local in-memory data like data frames and data tables, dplyr also works with remote on-disk data stored in databases. Generally, if your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle. The reason you'd want to use dplyr with a database is because either your data is already in a database (and you don't want to work with static csv files that someone else has dumped out for you), or you have so much data that it does not fit in memory and you have to use a database. Currently dplyr supports the three most popular open source databases (sqlite, mysql and postgresql), and google's bigquery.

Since R almost exclusively works with in-memory data, if you do have a lot of data in a database, you can't just dump it into R. Instead, you'll have to work with subsets or aggregates, and dplyr aims to make that as easy as possible. If you're working with large data, it's also likely that you'll need support to get the data into the database and to ensure you have the right indices for good performance. dplyr provides some simple tools to help with these tasks but they are no substitute for a local expert.

The motivation for supporting databases in dplyr is that you never pull down the right subset or aggregate from the database the first time, and usually you have to iterate between R and SQL many times before you get the perfect dataset. Switching between languages is cognitively challenging (especially because R and SQL are so perilously similar), so dplyr allows you to write R code that is automatically translated to SQL. The goal of dplyr is not to replace every SQL function with an R function: that would be difficult and error prone. Instead, dplyr only generates SELECT statements, the SQL you write most often as an analyst.

To get the most out of this chapter, you'll need to be familiar with querying SQL databases using the SELECT statement. If you have some familiarity with SQL and you'd like to learn more, I found how indexes work in SQLite and 10 easy steps to a complete understanding of SQL to be particularly helpful.

Getting started

To experiement with databases, it's easiest to get started with SQLite because everything you need is included in the R package. You don't need to install anything else, and you don't need to deal with the hassle of setting up a database server. Using a SQLite database in dplyr is really easy: just give it a path and the ok to create it.

my_db <- src_sqlite("my_db.sqlite3", create = T)

The main new concert here is the src, which is a collection of tables. Use src_sqlite(), src_mysql(), src_postgres() and src_bigquery() to connect to the different databases supported by dplyr.

my_db currently has no data in it, so we'll load it up with the hflights data using the convenient copy_to() function. This is a quick and dirty way of getting data into a database, but it's not suitable for very large datasets because all the data has to flow through R.

library(hflights)
hflights_sqlite <- copy_to(my_db, hflights, temporary = FALSE, indexes = list(
  c("Year", "Month", "DayofMonth"), "UniqueCarrier", "TailNum"))

As you can see, the copy_to() operation has an additional argument that allows you to supply indexes for the table. Here we set up indexes that will allow us to quickly process the data by day, by carrier and by plane. copy_to() also executes the SQL ANALYZE comomand: this ensures that the database has up-to-date table statistics and can pick appropriate query optimisations.

For this particular dataset, there's a built src that will cache hflights in a standard location:

hflights_sqlite <- tbl(hflights_sqlite(), "hflights")
#> Loading required package: hflights
#> Loading required package: RSQLite
#> Loading required package: DBI
#> Loading required package: RSQLite.extfuns
hflights_sqlite
#> Source: sqlite 3.7.17 [/private/tmp/RtmptxkEBK/Rinst1424b59c21ad0/dplyr/db/hflights.sqlite]
#> From: hflights [227,496 x 21]
#> 
#>    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
#> 1  2011     1          1         6    1400    1500            AA       428
#> 2  2011     1          2         7    1401    1501            AA       428
#> 3  2011     1          3         1    1352    1502            AA       428
#> 4  2011     1          4         2    1403    1513            AA       428
#> ..  ...   ...        ...       ...     ...     ...           ...       ...
#> Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
#>   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
#>   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
#>   CancellationCode (chr), Diverted (int)

You can also create tbl from arbitrary SQL:

tbl(my_db, sql("SELECT * FROM hflights"))

Basic verbs

Remote data sources use exactly the same five verbs as local data sources:

select(hflights_sqlite, Year:DayofMonth, DepDelay, ArrDelay)
#> Source: sqlite 3.7.17 [/private/tmp/RtmptxkEBK/Rinst1424b59c21ad0/dplyr/db/hflights.sqlite]
#> From: hflights [227,496 x 5]
#> 
#>    Year Month DayofMonth DepDelay ArrDelay
#> 1  2011     1          1        0      -10
#> 2  2011     1          2        1       -9
#> 3  2011     1          3       -8       -8
#> 4  2011     1          4        3        3
#> ..  ...   ...        ...      ...      ...
filter(hflights_sqlite, depDelay > 240)
#> Source: sqlite 3.7.17 [/private/tmp/RtmptxkEBK/Rinst1424b59c21ad0/dplyr/db/hflights.sqlite]
#> From: hflights [389 x 21]
#> Filter: depDelay > 240 
#> 
#>    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
#> 1  2011     1         28         5    1516    1916            CO         1
#> 2  2011     1         27         4    2137    2254            CO       150
#> 3  2011     1         20         4     635     807            CO        59
#> 4  2011     1         17         1    1838    2109            CO       746
#> ..  ...   ...        ...       ...     ...     ...           ...       ...
#> Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
#>   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
#>   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
#>   CancellationCode (chr), Diverted (int)
arrange(hflights_sqlite, Year, Month, DayofMonth)
#> Source: sqlite 3.7.17 [/private/tmp/RtmptxkEBK/Rinst1424b59c21ad0/dplyr/db/hflights.sqlite]
#> From: hflights [227,496 x 21]
#> Arrange: Year, Month, DayofMonth 
#> 
#>    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
#> 1  2011     1          1         6    1400    1500            AA       428
#> 2  2011     1          1         6     728     840            AA       460
#> 3  2011     1          1         6    1631    1736            AA      1121
#> 4  2011     1          1         6    1756    2112            AA      1294
#> ..  ...   ...        ...       ...     ...     ...           ...       ...
#> Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
#>   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
#>   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
#>   CancellationCode (chr), Diverted (int)
mutate(hflights_sqlite, speed = AirTime / Distance)
#> Source: sqlite 3.7.17 [/private/tmp/RtmptxkEBK/Rinst1424b59c21ad0/dplyr/db/hflights.sqlite]
#> From: hflights [227,496 x 22]
#> 
#>    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
#> 1  2011     1          1         6    1400    1500            AA       428
#> 2  2011     1          2         7    1401    1501            AA       428
#> 3  2011     1          3         1    1352    1502            AA       428
#> 4  2011     1          4         2    1403    1513            AA       428
#> ..  ...   ...        ...       ...     ...     ...           ...       ...
#> Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
#>   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
#>   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
#>   CancellationCode (chr), Diverted (int), speed (int)
summarise(hflights_sqlite, delay = mean(DepTime))
#> Source: sqlite 3.7.17 [/private/tmp/RtmptxkEBK/Rinst1424b59c21ad0/dplyr/db/hflights.sqlite]
#> From: <derived table> [?? x 1]
#> 
#>    delay
#> 1   1396
#> ..   ...

The most important difference is that the expressions in select(), filter(), arrange(), mutate(), and summarise() are translated into SQL so they can be run on the database. This translation is almost perfect for the most common operations but there are some limitations, which you'll learn about later.

Lazyness

When working with databases, dplyr tries to be as lazy as possible. It's lazy in two ways:

For example, take the following code:

c1 <- filter(hflights_sqlite, DepDelay > 0)
c2 <- select(c1, Year, Month, DayofMonth, UniqueCarrier, DepDelay, AirTime, Distance)
c3 <- mutate(c2, Speed = Distance / AirTime * 60)
c4 <- arrange(c3, Year, Month, DayofMonth, UniqueCarrier)

Suprisingly, this sequence of operations never actually touches the database. It's not until you ask for the data (e.g. by printing c4) that dplyr generates the SQL and requests the results from the database, and even then it only pulls down 10 rows.

c4
#> Source: sqlite 3.7.17 [/private/tmp/RtmptxkEBK/Rinst1424b59c21ad0/dplyr/db/hflights.sqlite]
#> From: hflights [109,996 x 8]
#> Filter: DepDelay > 0 
#> Arrange: Year, Month, DayofMonth, UniqueCarrier 
#> 
#>    Year Month DayofMonth UniqueCarrier DepDelay AirTime Distance Speed
#> 1  2011     1          1            AA        8      41      224   300
#> 2  2011     1          1            AA        1      37      224   360
#> 3  2011     1          1            AA        1     113      964   480
#> 4  2011     1          1            AA        6      39      224   300
#> ..  ...   ...        ...           ...      ...     ...      ...   ...

To pull down all the results use collect(), which returns a tbl_df():

collect(c4)
#> Source: local data frame [109,996 x 8]
#> 
#>    Year Month DayofMonth UniqueCarrier DepDelay AirTime Distance Speed
#> 1  2011     1          1            AA        8      41      224   300
#> 2  2011     1          1            AA        1      37      224   360
#> 3  2011     1          1            AA        1     113      964   480
#> 4  2011     1          1            AA        6      39      224   300
#> ..  ...   ...        ...           ...      ...     ...      ...   ...

You can see the query dplyr has generated by looking at the query component of the object:

c4$query
#> <Query> SELECT "Year", "Month", "DayofMonth", "UniqueCarrier", "DepDelay", "AirTime", "Distance", "Distance" / "AirTime" * 60.0 AS "Speed"
#> FROM "hflights"
#> WHERE "DepDelay" > 0.0
#> ORDER BY "Year", "Month", "DayofMonth", "UniqueCarrier"
#> <SQLiteConnection: DBI CON (82546, 0)>

You can also ask the database how it plans to execute the query with explain(). The output for SQLite is explained in more detail on the SQLite website and is helpful if you're trying to figure out what indexes are being used.

explain(c4)
#> <SQL>
#> SELECT "Year", "Month", "DayofMonth", "UniqueCarrier", "DepDelay", "AirTime", "Distance", "Distance" / "AirTime" * 60.0 AS "Speed"
#> FROM "hflights"
#> WHERE "DepDelay" > 0.0
#> ORDER BY "Year", "Month", "DayofMonth", "UniqueCarrier"
#> 
#> <PLAN>
#>   selectid order from                            detail
#> 1        0     0    0 SCAN TABLE hflights (~75832 rows)
#> 2        0     0    0      USE TEMP B-TREE FOR ORDER BY

Forcing computation

There are three ways to force the computation of a query:

You are most likely to use collect(): once you have interactively converged on the right set of operations, use collect() to pull down the data into a local tbl_df(). If you have some knowledge of SQL, you can use compute() and collapse() to optimise performance.

Performance considerations

dplyr tries to prevent you from accidentally performing expensive query operations:

SQL translation

When doing simple mathematical operations of the form you normally use when filtering, mutating and summarising it's relatively straightforward to translate R code to SQL (or indeed to any programming language).

To experiment with the translation, use translate_sql(). The following examples work through some basic differences between R and SQL.

# In SQLite variable names are escaped by double quotes:
translate_sql(x)
#> <SQL> "x"
# And strings are escaped by single quotes
translate_sql("x")
#> <SQL> 'x'

# Many functions have slightly different names
translate_sql(x == 1 && (y < 2 || z > 3))
#> <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)
#> <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)
#> <SQL> "x" % 2.0 = 10.0

# R and SQL have different defaults for integers vs reals.
# In R, 1 is an real, and 1L is an integer
# In SQL, 1 is an integer, and 1.0 is a real
translate_sql(1)
#> <SQL> 1.0
translate_sql(1L)
#> <SQL> 1

dplyr knows how to convert the following R functions to SQL:

The basic techniques underying the implementation of translate_sql() are described in the Advanced R book. translate_sql() is built on top of R's parsing engine and has been carefully design to generate correct sql. It also protects you against SQL injection attacks by correctly escaping strings and variable names as needed by the database that you're connecting to.

It's not possible to provide a perfect translation because databases don't have all the functions that R does. The goal of dplyr is to provide a semantic translation: to translate what you mean, not the precise details. Even for functions that exist both in databases and R you shouldn't expect results to be exactly the same; database programmers have different priorities to R core. For example, in R, mean() loops through the data twice in order to get a higher level of numerical accuracy at the cost of being twice as slow. R's mean() also provides a trim option for computing trimmed means, which databases do not provide. Databases automatically drop NULLs (their equivalent of missing values) whereas in R you have to ask nicely. This means the essense of simple calls like mean(x) will be translated accurately, but more complicated calls like mean(x, trim = 0.5, na.rm = TRUE) will raise an error:

translate_sql(mean(x, trim = T))
# Error: Invalid number of args to SQL AVG. Expecting 1

Any function that dplyr does't know how to convert it leaves as is - that means if you want to use any other function that database provides, you can use it as is. Here a couple of examples that will work with SQLite:

translate_sql(glob(x, y))
#> <SQL> GLOB("x", "y")
translate_sql(x %like% "ab*")
#> <SQL> "x" LIKE 'ab*'

Grouping

SQLite lacks window functions, which are needed for grouped mutation and filtering. This means that only really useful operation for grouped sqlite tables in summarise(). The grouped summarise from the introduction translates well - the only difference is that databases always drop NULLs (their equivalent of missing values), so we don't supply na.rm = TRUE.

planes <- group_by(hflights_sqlite, TailNum)
delay <- summarise(planes,
  count = n(),
  dist = mean(Distance),
  delay = mean(ArrDelay)
)
delay <- filter(delay, count > 20, dist < 2000)
delay_local <- collect(delay)

Other databases do support window functions and you can learn about them in the corresponding vignette. It's sometimes possible to simulate grouped filters and mutates using self joins, where you join the original table with a summarised version, but that topic is beyond the scope of this intro.

Other databases

Using other databases instead of SQLite works similarly, the overall workflow is the same regardless of what database you're connecting to. The following sections go in to more details on the pecularities of each database engine. All of these databases follow a client-server model - as well as your computer which is connecting to the databse, there is another computer actually running it (that might be your computer but usually isn't). Getting one of these databases setup up is beyond the scope of this article, but there are plenty of tutorials available on the web.

Postgresql

src_postgres() has five arguments: dbname, host, port, user and password. If you are running a local postgresql database with the default settings you'll only need dbname, but in most cases you'll need all five. dplyr uses the RPostgreSQL package to connect to postgres databases, which means you can't currently connect to remote databases that require a SSL connection (e.g. Heroku).

For example, the following code allows me to connect to a local postgresql database that contains a copy of the hflights data:

if (has_lahman("postgres")) {
  hflights_postgres <- tbl(src_postgres("hflights"), "hflights")
}
#> Loading required package: RPostgreSQL

Postgres is a considerably more powerful database than SQLite. It has:

The following examples shows the grouped filter and mutate possible with PostgreSQL. The SQL generated from the grouped filter is quite complex because you can't filter on window functions directly; instead they have to go in a subquery.

if (has_lahman("postgres")) {
  daily <- group_by(hflights_postgres, Year, Month, DayofMonth)

  # Find the most and least delayed flight each day
  bestworst <- filter(daily, ArrDelay == min(ArrDelay) ||
    ArrDelay == max(ArrDelay))
  bestworst$query

  # Rank each flight within a daily
  ranked <- mutate(daily, rank = rank(desc(ArrDelay)))
  ranked$query
}
#> Auto-disconnecting postgres connection (82546, 2)
#> <Query> SELECT "Year", "Month", "DayofMonth", "DayOfWeek", "DepTime", "ArrTime", "UniqueCarrier", "FlightNum", "TailNum", "ActualElapsedTime", "AirTime", "ArrDelay", "DepDelay", "Origin", "Dest", "Distance", "TaxiIn", "TaxiOut", "Cancelled", "CancellationCode", "Diverted", rank() OVER (PARTITION BY ("Year", "Month", "DayofMonth") ORDER BY "ArrDelay" DESC) AS "rank"
#> FROM "hflights"
#> An object of class "PostgreSQLConnection"
#> Slot "Id":
#> [1] 82546     1

MySQL and MariaDB

You can connect to MySQL and MariaDB (a recent fork of MySQL) through src_mysql(), mediated by the RMySQL package. Like PostgreSQL, you'll need to provide a dbname, username, password, host, and port.

In terms of functionality, MySQL lies somewhere between SQLite and PostgreSQL. It provides a wider range of built-in functions, but it does not support window functions (so you can't do grouped mutates and filters).

Bigquery

Bigquery is a hosted database server provided by google. To connect, you need to provide your project, dataset and optionally a project for billing (if billing for project isn't enabled). After you create the src, your web browser will open and ask you to authenticate. Your credentials are stored in a local cache, so you should only need to do this once.

Bigquery supports only a single SQL statement: SELECT. Fortunately this is all you need for data analysis, and within SELECT bigquery provides comprehensive coverage similar level to postgresql.

Picking a database

If you don't already have a database, here's some advice from my experiences setting up and running all of them. SQLite is by far the easiest to get started with, but the lack of window functions make it limited for data analysis. PostgreSQL is not too much harder to use and has a wide range of built in functions. Don't bother with MySQL/MariaDB: it's a pain to set up and the documentation is subpar. Google bigquery might be a good fit if you have very large data, or you're willing to pay (a small amount of) money for someone else to look after your database.