Creating

data_frame() is a nice way to create data frames. It encapsulates best practices for data frames:

Coercion

To complement data_frame(), dplyr provides as_data_frame() to coerce lists into data frames. It does two things:

This is much simpler than as.data.frame(). It’s hard to explain precisely what as.data.frame() does, but it’s similar to do.call(cbind, lapply(x, data.frame)) - i.e. it coerces each component to a data frame and then cbinds() them all together. Consequently as_data_frame() is much faster than as.data.frame():

l2 <- replicate(26, sample(100), simplify = FALSE)
names(l2) <- letters
microbenchmark::microbenchmark(
  as_data_frame(l2),
  as.data.frame(l2)
)
#> Unit: microseconds
#>               expr      min        lq      mean   median        uq
#>  as_data_frame(l2)  102.631  113.7605  135.0108  121.490  142.8445
#>  as.data.frame(l2) 1524.588 1619.3670 1901.9522 1739.363 2063.0425
#>       max neval cld
#>   318.934   100  a 
#>  3705.862   100   b

The speed of as.data.frame() is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame.

Memory

One of the reasons that dplyr is fast is that it is very careful about when it makes copies. This section describes how this works, and gives you some useful tools for understanding the memory usage of data frames in R.

The first tool we’ll use is dplyr::location(). It tells us the memory location of three components of a data frame object:

location(iris)
#> <0x7fc5ad268b40>
#> Variables:
#>  * Sepal.Length: <0x7fc5ad1f8a00>
#>  * Sepal.Width:  <0x7fc5ad258a00>
#>  * Petal.Length: <0x7fc5ad274000>
#>  * Petal.Width:  <0x7fc5ad273200>
#>  * Species:      <0x7fc5ae8364e0>
#> Attributes:
#>  * names:        <0x7fc5ad268ba8>
#>  * row.names:    <0x7fc5ae817410>
#>  * class:        <0x7fc5ad294798>

It’s useful to know the memory address, because if the address changes, then you’ll know that R has made a copy. Copies are bad because they take time to create. This isn’t usually a bottleneck if you have a few thousand values, but if you have millions or tens of millions of values it starts to take significant amounts of time. Unnecessary copies are also bad because they take up memory.

R tries to avoid making copies where possible. For example, if you just assign iris to another variable, it continues to the point same location:

iris2 <- iris
location(iris2)
#> <0x7fc5ad268b40>
#> Variables:
#>  * Sepal.Length: <0x7fc5ad1f8a00>
#>  * Sepal.Width:  <0x7fc5ad258a00>
#>  * Petal.Length: <0x7fc5ad274000>
#>  * Petal.Width:  <0x7fc5ad273200>
#>  * Species:      <0x7fc5ae8364e0>
#> Attributes:
#>  * names:        <0x7fc5ad268ba8>
#>  * row.names:    <0x7fc5aafd30c0>
#>  * class:        <0x7fc5ad294798>

Rather than having to compare hard to read memory locations, we can instead use the dplyr::changes() function to highlights changes between two versions of a data frame. The code below shows us that iris and iris2 are identical: both names point to the same location in memory.

changes(iris2, iris)
#> <identical>

What do you think happens if you modify a single column of iris2? In R 3.1.0 and above, R knows to modify only that one column and to leave the others pointing to their existing locations:

iris2$Sepal.Length <- iris2$Sepal.Length * 2
changes(iris, iris2)
#> Changed variables:
#>              old            new           
#> Sepal.Length 0x7fc5ad1f8a00 0x7fc5af730c00
#> 
#> Changed attributes:
#>              old            new           
#> row.names    0x7fc5aac62ab0 0x7fc5aac96a40

(This was not the case prior to version 3.1.0, where R created a deep copy of the entire data frame.)

dplyr is equally smart:

iris3 <- mutate(iris, Sepal.Length = Sepal.Length * 2)
changes(iris3, iris)
#> Changed variables:
#>              old            new           
#> Sepal.Length 0x7fc5b008a400 0x7fc5ad1f8a00
#> 
#> Changed attributes:
#>              old            new           
#> class        0x7fc5ae76c138 0x7fc5ad294798
#> names        0x7fc5af732a70 0x7fc5ad268ba8
#> row.names    0x7fc5aaca8ae0 0x7fc5aaca8d60

It creates only one new column while all the other columns continue to point at their original locations. You might notice that the attributes are still copied. However, this has little impact on performance. Because attributes are usually short vectors, the internal dplyr code needed to copy them is also considerably simpler.

dplyr never makes copies unless it has to:

In short, dplyr lets you work with data frames with very little memory overhead.

data.table takes this idea one step further: it provides functions that modify a data table in place. This avoids the need to make copies of pointers to existing columns and attributes, and speeds up operations when you have many columns. dplyr doesn’t do this with data frames (although it could) because I think it’s safer to keep data immutable: even if the resulting data frame shares practically all the data of the original data frame, all dplyr data frame methods return a new data frame.