Efficient reshaping using data.tables

2017-01-31

This vignette discusses the default usage of reshaping functions melt (wide to long) and dcast (long to wide) for data.tables as well as the new extended functionalities of melting and casting on multiple columns available from v1.9.6.


Data

We will load the data sets directly within sections.

Introduction

The melt and dcast functions for data.tables are extensions of the corresponding functions from the reshape2 package.

In this vignette, we will

  1. first briefly look at the default melting and casting of data.tables to convert them from wide to long format and vice versa,

  2. then look at scenarios where the current functionalities becomes cumbersome and inefficient,

  3. and finally look at the new improvements to both melt and dcast methods for data.tables to handle multiple columns simultaneously.

The extended functionalities are in line with data.table’s philosophy of performing operations efficiently and in a straightforward manner.

Note:

From v1.9.6 on, you don’t have to load reshape2 package to use these functions for data.tables. You just need to load data.table. If you’ve to load reshape2 for melting or casting matrices and/or data.frames, then make sure to load it before loading data.table.

1. Default functionality

a) melting data.tables (wide to long)

Suppose we have a data.table (artificial data) as shown below:

DT = fread("melt_default.csv")
DT
#    family_id age_mother dob_child1 dob_child2 dob_child3
# 1:         1         30 1998-11-26 2000-01-29         NA
# 2:         2         27 1996-06-22         NA         NA
# 3:         3         26 2002-07-11 2004-04-05 2007-09-02
# 4:         4         32 2004-10-10 2009-08-27 2012-07-21
# 5:         5         29 2000-12-05 2005-02-28         NA
## dob stands for date of birth.

str(DT)
# Classes 'data.table' and 'data.frame':    5 obs. of  5 variables:
#  $ family_id : int  1 2 3 4 5
#  $ age_mother: int  30 27 26 32 29
#  $ dob_child1: chr  "1998-11-26" "1996-06-22" "2002-07-11" "2004-10-10" ...
#  $ dob_child2: chr  "2000-01-29" NA "2004-04-05" "2009-08-27" ...
#  $ dob_child3: chr  NA NA "2007-09-02" "2012-07-21" ...
#  - attr(*, ".internal.selfref")=<externalptr>

- Convert DT to long form where each dob is a separate observation.

We could accomplish this using melt() by specifying id.vars and measure.vars arguments as follows:

DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"),
                measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
DT.m1
#     family_id age_mother   variable      value
#  1:         1         30 dob_child1 1998-11-26
#  2:         2         27 dob_child1 1996-06-22
#  3:         3         26 dob_child1 2002-07-11
#  4:         4         32 dob_child1 2004-10-10
#  5:         5         29 dob_child1 2000-12-05
#  6:         1         30 dob_child2 2000-01-29
#  7:         2         27 dob_child2         NA
#  8:         3         26 dob_child2 2004-04-05
#  9:         4         32 dob_child2 2009-08-27
# 10:         5         29 dob_child2 2005-02-28
# 11:         1         30 dob_child3         NA
# 12:         2         27 dob_child3         NA
# 13:         3         26 dob_child3 2007-09-02
# 14:         4         32 dob_child3 2012-07-21
# 15:         5         29 dob_child3         NA
str(DT.m1)
# Classes 'data.table' and 'data.frame':    15 obs. of  4 variables:
#  $ family_id : int  1 2 3 4 5 1 2 3 4 5 ...
#  $ age_mother: int  30 27 26 32 29 30 27 26 32 29 ...
#  $ variable  : Factor w/ 3 levels "dob_child1","dob_child2",..: 1 1 1 1 1 2 2 2 2 2 ...
#  $ value     : chr  "1998-11-26" "1996-06-22" "2002-07-11" "2004-10-10" ...
#  - attr(*, ".internal.selfref")=<externalptr>

- Name the variable and value columns to child and dob respectively

DT.m1 = melt(DT, measure.vars = c("dob_child1", "dob_child2", "dob_child3"),
               variable.name = "child", value.name = "dob")
DT.m1
#     family_id age_mother      child        dob
#  1:         1         30 dob_child1 1998-11-26
#  2:         2         27 dob_child1 1996-06-22
#  3:         3         26 dob_child1 2002-07-11
#  4:         4         32 dob_child1 2004-10-10
#  5:         5         29 dob_child1 2000-12-05
#  6:         1         30 dob_child2 2000-01-29
#  7:         2         27 dob_child2         NA
#  8:         3         26 dob_child2 2004-04-05
#  9:         4         32 dob_child2 2009-08-27
# 10:         5         29 dob_child2 2005-02-28
# 11:         1         30 dob_child3         NA
# 12:         2         27 dob_child3         NA
# 13:         3         26 dob_child3 2007-09-02
# 14:         4         32 dob_child3 2012-07-21
# 15:         5         29 dob_child3         NA

b) Casting data.tables (long to wide)

In the previous section, we saw how to get from wide form to long form. Let’s see the reverse operation in this section.

- How can we get back to the original data table DT from DT.m?

That is, we’d like to collect all child observations corresponding to each family_id, age_mother together under the same row. We can accomplish it using dcast as follows:

dcast(DT.m1, family_id + age_mother ~ child, value.var = "dob")
#    family_id age_mother dob_child1 dob_child2 dob_child3
# 1:         1         30 1998-11-26 2000-01-29         NA
# 2:         2         27 1996-06-22         NA         NA
# 3:         3         26 2002-07-11 2004-04-05 2007-09-02
# 4:         4         32 2004-10-10 2009-08-27 2012-07-21
# 5:         5         29 2000-12-05 2005-02-28         NA

  • dcast uses formula interface. The variables on the LHS of formula represents the id vars and RHS the measure vars.

  • value.var denotes the column to be filled in with while casting to wide format.

  • dcast also tries to preserve attributes in result wherever possible.

- Starting from DT.m, how can we get the number of children in each family?

You can also pass a function to aggregate by in dcast with the argument fun.aggregate. This is particularly essential when the formula provided does not identify single observation for each cell.

dcast(DT.m1, family_id ~ ., fun.agg = function(x) sum(!is.na(x)), value.var = "dob")
#    family_id .
# 1:         1 2
# 2:         2 1
# 3:         3 3
# 4:         4 3
# 5:         5 2

Check ?dcast for other useful arguments and additional examples.

2. Limitations in current melt/dcast approaches

So far we’ve seen features of melt and dcast that are based on reshape2 package, but implemented efficiently for data.tables, using internal data.table machinery (fast radix ordering, binary search etc..).

However, there are situations we might run into where the desired operation is not expressed in a straightforward manner. For example, consider the data.table shown below:

DT = fread("melt_enhanced.csv")
DT
#    family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
# 1:         1         30 1998-11-26 2000-01-29         NA             1             2            NA
# 2:         2         27 1996-06-22         NA         NA             2            NA            NA
# 3:         3         26 2002-07-11 2004-04-05 2007-09-02             2             2             1
# 4:         4         32 2004-10-10 2009-08-27 2012-07-21             1             1             1
# 5:         5         29 2000-12-05 2005-02-28         NA             2             1            NA
## 1 = female, 2 = male

And you’d like to combine (melt) all the dob columns together, and gender columns together. Using the current functionality, we can do something like this:

DT.m1 = melt(DT, id = c("family_id", "age_mother"))
# Warning in melt.data.table(DT, id = c("family_id", "age_mother")): 'measure.vars' [dob_child1,
# dob_child2, dob_child3, gender_child1, ...] are not all of the same type. By order of hierarchy, the
# molten data value column will be of type 'character'. All measure variables not of type 'character'
# will be coerced to. Check DETAILS in ?melt.data.table for more on coercion.
DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed = TRUE)]
#     family_id age_mother variable      value  child
#  1:         1         30      dob 1998-11-26 child1
#  2:         2         27      dob 1996-06-22 child1
#  3:         3         26      dob 2002-07-11 child1
#  4:         4         32      dob 2004-10-10 child1
#  5:         5         29      dob 2000-12-05 child1
#  6:         1         30      dob 2000-01-29 child2
#  7:         2         27      dob         NA child2
#  8:         3         26      dob 2004-04-05 child2
#  9:         4         32      dob 2009-08-27 child2
# 10:         5         29      dob 2005-02-28 child2
# 11:         1         30      dob         NA child3
# 12:         2         27      dob         NA child3
# 13:         3         26      dob 2007-09-02 child3
# 14:         4         32      dob 2012-07-21 child3
# 15:         5         29      dob         NA child3
# 16:         1         30   gender          1 child1
# 17:         2         27   gender          2 child1
# 18:         3         26   gender          2 child1
# 19:         4         32   gender          1 child1
# 20:         5         29   gender          2 child1
# 21:         1         30   gender          2 child2
# 22:         2         27   gender         NA child2
# 23:         3         26   gender          2 child2
# 24:         4         32   gender          1 child2
# 25:         5         29   gender          1 child2
# 26:         1         30   gender         NA child3
# 27:         2         27   gender         NA child3
# 28:         3         26   gender          1 child3
# 29:         4         32   gender          1 child3
# 30:         5         29   gender         NA child3
#     family_id age_mother variable      value  child
DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
DT.c1
#     family_id age_mother  child        dob gender
#  1:         1         30 child1 1998-11-26      1
#  2:         1         30 child2 2000-01-29      2
#  3:         1         30 child3         NA     NA
#  4:         2         27 child1 1996-06-22      2
#  5:         2         27 child2         NA     NA
#  6:         2         27 child3         NA     NA
#  7:         3         26 child1 2002-07-11      2
#  8:         3         26 child2 2004-04-05      2
#  9:         3         26 child3 2007-09-02      1
# 10:         4         32 child1 2004-10-10      1
# 11:         4         32 child2 2009-08-27      1
# 12:         4         32 child3 2012-07-21      1
# 13:         5         29 child1 2000-12-05      2
# 14:         5         29 child2 2005-02-28      1
# 15:         5         29 child3         NA     NA

str(DT.c1) ## gender column is character type now!
# Classes 'data.table' and 'data.frame':    15 obs. of  5 variables:
#  $ family_id : int  1 1 1 2 2 2 3 3 3 4 ...
#  $ age_mother: int  30 30 30 27 27 27 26 26 26 32 ...
#  $ child     : chr  "child1" "child2" "child3" "child1" ...
#  $ dob       : chr  "1998-11-26" "2000-01-29" NA "1996-06-22" ...
#  $ gender    : chr  "1" "2" NA "2" ...
#  - attr(*, ".internal.selfref")=<externalptr> 
#  - attr(*, "sorted")= chr  "family_id" "age_mother" "child"

Issues

  1. What we wanted to do was to combine all the dob and gender type columns together respectively. Instead we are combining everything together, and then splitting them again. I think it’s easy to see that it’s quite roundabout (and inefficient).

    As an analogy, imagine you’ve a closet with four shelves of clothes and you’d like to put together the clothes from shelves 1 and 2 together (in 1), and 3 and 4 together (in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3!

  2. The columns to melt may be of different types, as in this case (character and integer types). By melting them all together, the columns will be coerced in result, as explained by the warning message above and shown from output of str(DT.c1), where gender has been converted to character type.

  3. We are generating an additional column by splitting the variable column into two columns, whose purpose is quite cryptic. We do it because we need it for casting in the next step.

  4. Finally, we cast the data set. But the issue is it’s a much more computationally involved operation than melt. Specifically, it requires computing the order of the variables in formula, and that’s costly.

In fact, base::reshape is capable of performing this operation in a very straightforward manner. It is an extremely useful and often underrated function. You should definitely give it a try!

3. Enhanced (new) functionality

a) Enhanced melt

Since we’d like for data.tables to perform this operation straightforward and efficient using the same interface, we went ahead and implemented an additional functionality, where we can melt to multiple columns simultaneously.

- melt multiple columns simultaneously

The idea is quite simple. We pass a list of columns to measure.vars, where each element of the list contains the columns that should be combined together.

colA = paste("dob_child", 1:3, sep = "")
colB = paste("gender_child", 1:3, sep = "")
DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender"))
DT.m2
#     family_id age_mother variable        dob gender
#  1:         1         30        1 1998-11-26      1
#  2:         2         27        1 1996-06-22      2
#  3:         3         26        1 2002-07-11      2
#  4:         4         32        1 2004-10-10      1
#  5:         5         29        1 2000-12-05      2
#  6:         1         30        2 2000-01-29      2
#  7:         2         27        2         NA     NA
#  8:         3         26        2 2004-04-05      2
#  9:         4         32        2 2009-08-27      1
# 10:         5         29        2 2005-02-28      1
# 11:         1         30        3         NA     NA
# 12:         2         27        3         NA     NA
# 13:         3         26        3 2007-09-02      1
# 14:         4         32        3 2012-07-21      1
# 15:         5         29        3         NA     NA

str(DT.m2) ## col type is preserved
# Classes 'data.table' and 'data.frame':    15 obs. of  5 variables:
#  $ family_id : int  1 2 3 4 5 1 2 3 4 5 ...
#  $ age_mother: int  30 27 26 32 29 30 27 26 32 29 ...
#  $ variable  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
#  $ dob       : chr  "1998-11-26" "1996-06-22" "2002-07-11" "2004-10-10" ...
#  $ gender    : int  1 2 2 1 2 2 NA 2 1 1 ...
#  - attr(*, ".internal.selfref")=<externalptr>

- Using patterns()

Usually in these problems, the columns we’d like to melt can be distinguished by a common pattern. We can use the function patterns(), implemented for convenience, to provide regular expressions for the columns to be combined together. The above operation can be rewritten as:

DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
DT.m2
#     family_id age_mother variable        dob gender
#  1:         1         30        1 1998-11-26      1
#  2:         2         27        1 1996-06-22      2
#  3:         3         26        1 2002-07-11      2
#  4:         4         32        1 2004-10-10      1
#  5:         5         29        1 2000-12-05      2
#  6:         1         30        2 2000-01-29      2
#  7:         2         27        2         NA     NA
#  8:         3         26        2 2004-04-05      2
#  9:         4         32        2 2009-08-27      1
# 10:         5         29        2 2005-02-28      1
# 11:         1         30        3         NA     NA
# 12:         2         27        3         NA     NA
# 13:         3         26        3 2007-09-02      1
# 14:         4         32        3 2012-07-21      1
# 15:         5         29        3         NA     NA

That’s it!

  • We can remove the variable column if necessary.

  • The functionality is implemented entirely in C, and is therefore both fast and memory efficient in addition to being straightforward.

b) Enhanced dcast

Okay great! We can now melt into multiple columns simultaneously. Now given the data set DT.m2 as shown above, how can we get back to the same format as the original data we started with?

If we use the current functionality of dcast, then we’d have to cast twice and bind the results together. But that’s once again verbose, not straightforward and is also inefficient.

- Casting multiple value.vars simultaneously

We can now provide multiple value.var columns to dcast for data.tables directly so that the operations are taken care of internally and efficiently.

## new 'cast' functionality - multiple value.vars
DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender"))
DT.c2
#    family_id age_mother      dob_1      dob_2      dob_3 gender_1 gender_2 gender_3
# 1:         1         30 1998-11-26 2000-01-29         NA        1        2       NA
# 2:         2         27 1996-06-22         NA         NA        2       NA       NA
# 3:         3         26 2002-07-11 2004-04-05 2007-09-02        2        2        1
# 4:         4         32 2004-10-10 2009-08-27 2012-07-21        1        1        1
# 5:         5         29 2000-12-05 2005-02-28         NA        2        1       NA

  • Attributes are preserved in result wherever possible.

  • Everything is taken care of internally, and efficiently. In addition to being fast, it is also very memory efficient.

Multiple functions to fun.aggregate:

You can also provide multiple functions to fun.aggregate to dcast for data.tables. Check the examples in ?dcast which illustrates this functionality.