One of the reasons that dplyr is fast is that it’s very careful about when to make copies. This section describes how this works, and gives you some useful tools for understanding the memory usage of data frames in R.
The first tool we’ll use is
dplyr::location(). It tells us the memory location of three components of a data frame object:
location(iris) #> <0x7fdc68e309a8> #> Variables: #> * Sepal.Length: <0x7fdc68f06200> #> * Sepal.Width: <0x7fdc68f25000> #> * Petal.Length: <0x7fdc68f25600> #> * Petal.Width: <0x7fdc68f25c00> #> * Species: <0x7fdc6843e9e0> #> Attributes: #> * names: <0x7fdc68e30940> #> * row.names: <0x7fdc6843fa00> #> * class: <0x7fdc688bbb48>
It’s useful to know the memory address, because if the address changes, then you’ll know that R has made a copy. Copies are bad because they take time to create. This isn’t usually a bottleneck if you have a few thousand values, but if you have millions or tens of millions of values it starts to take significant amounts of time. Unnecessary copies are also bad because they take up memory.
R tries to avoid making copies where possible. For example, if you just assign
iris to another variable, it continues to the point same location:
iris2 <- iris location(iris2) #> <0x7fdc68e309a8> #> Variables: #> * Sepal.Length: <0x7fdc68f06200> #> * Sepal.Width: <0x7fdc68f25000> #> * Petal.Length: <0x7fdc68f25600> #> * Petal.Width: <0x7fdc68f25c00> #> * Species: <0x7fdc6843e9e0> #> Attributes: #> * names: <0x7fdc68e30940> #> * row.names: <0x7fdc6843c9e0> #> * class: <0x7fdc688bbb48>
Rather than having to compare hard to read memory locations, we can instead use the
dplyr::changes() function to highlights changes between two versions of a data frame. The code below shows us that
iris2 are identical: both names point to the same location in memory.
changes(iris2, iris) #> <identical>
What do you think happens if you modify a single column of
iris2? In R 3.1.0 and above, R knows to modify only that one column and to leave the others pointing to their existing locations:
iris2$Sepal.Length <- iris2$Sepal.Length * 2 changes(iris, iris2) #> Changed variables: #> old new #> Sepal.Length 0x7fdc68f06200 0x7fdc68fda800 #> #> Changed attributes: #> old new #> row.names 0x7fdc6ba02760 0x7fdc6ba029e0
(This was not the case prior to version 3.1.0, where R created a deep copy of the entire data frame.)
dplyr is equally smart:
iris3 <- mutate(iris, Sepal.Length = Sepal.Length * 2) changes(iris3, iris) #> Changed variables: #> old new #> Sepal.Length 0x7fdc6952e000 0x7fdc68f06200 #> #> Changed attributes: #> old new #> class 0x7fdc6ab8ef58 0x7fdc688bbb48 #> names 0x7fdc699bed48 0x7fdc68e30940 #> row.names 0x7fdc6840c2d0 0x7fdc68427560
It creates only one new column while all the other columns continue to point at their original locations. You might notice that the attributes are still copied. However, this has little impact on performance. Because attributes are usually short vectors, the internal dplyr code needed to copy them is also considerably simpler.
dplyr never makes copies unless it has to:
group_by() don’t copy columns
select() never copies columns, even when you rename them
mutate() never copies columns, except when you modify an existing column
arrange() must always copy all columns because you’re changing the order of every one. This is an expensive operation for big data, but you can generally avoid it using the order argument to window functions
summarise() creates new data, but it’s usually at least an order of magnitude smaller than the original data.
In short, dplyr lets you work with data frames with very little memory overhead.
data.table takes this idea one step further: it provides functions that modify a data table in place. This avoids the need to make copies of pointers to existing columns and attributes, and speeds up operations when you have many columns. dplyr doesn’t do this with data frames (although it could) because I think it’s safer to keep data immutable: even if the resulting data frame shares practically all the data of the original data frame, all dplyr data frame methods return a new data frame.