Capture melt

This vignette explains how to use functions for “melting” wide data tables, i.e. converting to tall data tables:

Both are useful mainly in the case when you want to use a regular expression to specify both (1) the set of input columns to reshape and (2) some information to extract from those column names. To clarify the discussion we introduce the following three terms:

Capture data.frame column names and melt matching columns

Sometimes you want to melt a “wide” data table which has several distinct pieces of information encoded in each column name. One example is the familiar iris data, which have flower part and measurement dimension encoded in each of four column names:

library(data.table)
(iris.dt <- data.table(i=1:nrow(iris), iris[,1:4], Species=paste(iris$Species)))
#>        i Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#>   1:   1          5.1         3.5          1.4         0.2    setosa
#>   2:   2          4.9         3.0          1.4         0.2    setosa
#>   3:   3          4.7         3.2          1.3         0.2    setosa
#>   4:   4          4.6         3.1          1.5         0.2    setosa
#>   5:   5          5.0         3.6          1.4         0.2    setosa
#>  ---                                                                
#> 146: 146          6.7         3.0          5.2         2.3 virginica
#> 147: 147          6.3         2.5          5.0         1.9 virginica
#> 148: 148          6.5         3.0          5.2         2.0 virginica
#> 149: 149          6.2         3.4          5.4         2.3 virginica
#> 150: 150          5.9         3.0          5.1         1.8 virginica

Those four reshape column names can be specified via a regex in nc::capture_melt_single. The first argument is the input data table to reshape, and the subsequent arguments are interpreted as a pattern which is passed to nc::capture_first_vec. Any input column names which match the specified regex will be passed as measure.vars to melt:

(iris.tall <- nc::capture_melt_single(
  iris.dt,
  part=".*",
  "[.]",
  dim=".*"))
#>        i   Species  part    dim value
#>   1:   1    setosa Sepal Length   5.1
#>   2:   2    setosa Sepal Length   4.9
#>   3:   3    setosa Sepal Length   4.7
#>   4:   4    setosa Sepal Length   4.6
#>   5:   5    setosa Sepal Length   5.0
#>  ---                                 
#> 596: 146 virginica Petal  Width   2.3
#> 597: 147 virginica Petal  Width   1.9
#> 598: 148 virginica Petal  Width   2.0
#> 599: 149 virginica Petal  Width   2.3
#> 600: 150 virginica Petal  Width   1.8

The output comes from joining the result of nc::capture_first_vec to the result of melt. Note the output has two copy columns (i, Species), one reshape column (value), and two capture columns (part, dim).

After the data have been converted to the tall table above, we can do a dcast to get a wider data table:

(iris.part.cols <- dcast(
  iris.tall,
  i + Species + dim ~ part))
#>        i   Species    dim Petal Sepal
#>   1:   1    setosa Length   1.4   5.1
#>   2:   1    setosa  Width   0.2   3.5
#>   3:   2    setosa Length   1.4   4.9
#>   4:   2    setosa  Width   0.2   3.0
#>   5:   3    setosa Length   1.3   4.7
#>  ---                                 
#> 296: 148 virginica  Width   2.0   3.0
#> 297: 149 virginica Length   5.4   6.2
#> 298: 149 virginica  Width   2.3   3.4
#> 299: 150 virginica Length   5.1   5.9
#> 300: 150 virginica  Width   1.8   3.0

Rather than use capture_melt_single and then dcast, we could instead use capture_melt_multiple to get the same result:

nc::capture_melt_multiple(
  iris.dt,
  column=".*",
  "[.]",
  dim=".*")
#>        i   Species    dim Petal Sepal
#>   1:   1    setosa Length   1.4   5.1
#>   2:   2    setosa Length   1.4   4.9
#>   3:   3    setosa Length   1.3   4.7
#>   4:   4    setosa Length   1.5   4.6
#>   5:   5    setosa Length   1.4   5.0
#>  ---                                 
#> 296: 146 virginica  Width   2.3   3.0
#> 297: 147 virginica  Width   1.9   2.5
#> 298: 148 virginica  Width   2.0   3.0
#> 299: 149 virginica  Width   2.3   3.4
#> 300: 150 virginica  Width   1.8   3.0

Note that the reshaped table above contains two copy columns (i, Species), two reshape columns (Sepal, Petal), and one capture columns (dim). These can help us visualize whether or not sepals are bigger than petals:

library(ggplot2)
ggplot()+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  facet_grid(dim ~ Species)+
  coord_equal()+
  geom_abline(slope=1, intercept=0, color="grey")+
  geom_point(aes(
    Petal, Sepal),
    data=iris.part.cols)

plot of chunk unnamed-chunk-5

It is clear from the plot above that sepals are indeed both longer and wider than petals, on each measured plant.

Another data set where it is useful to do column name pattern matching followed by melting is the World Health Organization data:

data(who, package="tidyr")
names(who)
#>  [1] "country"      "iso2"         "iso3"         "year"         "new_sp_m014" 
#>  [6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
#> [11] "new_sp_m65"   "new_sp_f014"  "new_sp_f1524" "new_sp_f2534" "new_sp_f3544"
#> [16] "new_sp_f4554" "new_sp_f5564" "new_sp_f65"   "new_sn_m014"  "new_sn_m1524"
#> [21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564" "new_sn_m65"  
#> [26] "new_sn_f014"  "new_sn_f1524" "new_sn_f2534" "new_sn_f3544" "new_sn_f4554"
#> [31] "new_sn_f5564" "new_sn_f65"   "new_ep_m014"  "new_ep_m1524" "new_ep_m2534"
#> [36] "new_ep_m3544" "new_ep_m4554" "new_ep_m5564" "new_ep_m65"   "new_ep_f014" 
#> [41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554" "new_ep_f5564"
#> [46] "new_ep_f65"   "newrel_m014"  "newrel_m1524" "newrel_m2534" "newrel_m3544"
#> [51] "newrel_m4554" "newrel_m5564" "newrel_m65"   "newrel_f014"  "newrel_f1524"
#> [56] "newrel_f2534" "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"

Each column which starts with new has three distinct pieces of information encoded in its name: diagnosis type (e.g. sp or rel), gender (m or f), and age range (e.g. 5564 or 1524). We would like to use a regex to match these column names, then using the matching columns as measure.vars in a melt, then join the two results.

new.diag.gender <- list(
  "new_?",
  diagnosis=".*",
  "_",
  gender=".")
nc::capture_melt_single(who, new.diag.gender, ages=".*")
#>                          country iso2 iso3 year diagnosis gender ages value
#>     1:               Afghanistan   AF  AFG 1997        sp      m  014     0
#>     2:               Afghanistan   AF  AFG 1998        sp      m  014    30
#>     3:               Afghanistan   AF  AFG 1999        sp      m  014     8
#>     4:               Afghanistan   AF  AFG 2000        sp      m  014    52
#>     5:               Afghanistan   AF  AFG 2001        sp      m  014   129
#>    ---                                                                     
#> 76042:                  Viet Nam   VN  VNM 2013       rel      f   65  3110
#> 76043: Wallis and Futuna Islands   WF  WLF 2013       rel      f   65     2
#> 76044:                     Yemen   YE  YEM 2013       rel      f   65   360
#> 76045:                    Zambia   ZM  ZMB 2013       rel      f   65   669
#> 76046:                  Zimbabwe   ZW  ZWE 2013       rel      f   65   725

Note the output includes the new reshape column called value by default, as in melt. The input reshape column names which matched the specified pattern, and there is a new column for each group in that pattern. The following example shows how to rename the value column and use numeric type conversion functions:

years.pattern <- list(new.diag.gender, ages=list(
  min.years="0|[0-9]{2}", as.numeric,
  max.years="[0-9]{0,2}", function(x)ifelse(x=="", Inf, as.numeric(x))))
(who.typed <- nc::capture_melt_single(
  who, years.pattern,
  value.name="count"))
#>                          country iso2 iso3 year diagnosis gender ages min.years
#>     1:               Afghanistan   AF  AFG 1997        sp      m  014         0
#>     2:               Afghanistan   AF  AFG 1998        sp      m  014         0
#>     3:               Afghanistan   AF  AFG 1999        sp      m  014         0
#>     4:               Afghanistan   AF  AFG 2000        sp      m  014         0
#>     5:               Afghanistan   AF  AFG 2001        sp      m  014         0
#>    ---                                                                         
#> 76042:                  Viet Nam   VN  VNM 2013       rel      f   65        65
#> 76043: Wallis and Futuna Islands   WF  WLF 2013       rel      f   65        65
#> 76044:                     Yemen   YE  YEM 2013       rel      f   65        65
#> 76045:                    Zambia   ZM  ZMB 2013       rel      f   65        65
#> 76046:                  Zimbabwe   ZW  ZWE 2013       rel      f   65        65
#>        max.years count
#>     1:        14     0
#>     2:        14    30
#>     3:        14     8
#>     4:        14    52
#>     5:        14   129
#>    ---                
#> 76042:       Inf  3110
#> 76043:       Inf     2
#> 76044:       Inf   360
#> 76045:       Inf   669
#> 76046:       Inf   725
str(who.typed)
#> Classes 'data.table' and 'data.frame':   76046 obs. of  10 variables:
#>  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
#>  $ iso2     : chr  "AF" "AF" "AF" "AF" ...
#>  $ iso3     : chr  "AFG" "AFG" "AFG" "AFG" ...
#>  $ year     : int  1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 ...
#>  $ diagnosis: chr  "sp" "sp" "sp" "sp" ...
#>  $ gender   : chr  "m" "m" "m" "m" ...
#>  $ ages     : chr  "014" "014" "014" "014" ...
#>  $ min.years: num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ max.years: num  14 14 14 14 14 14 14 14 14 14 ...
#>  $ count    : int  0 30 8 52 129 90 127 139 151 193 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

The result above shows that

Comparison with tidyr

The nc::capture_melt_single function was inspired by the tidyr package, which can accomplish the same computation via:

tidyr::pivot_longer(
  who, new_sp_m014:newrel_f65,
  names_to=c("diagnosis", "gender", "ages"),
  names_pattern="new_?(.*)_(.)(.*)")
#> # A tibble: 405,440 x 8
#>    country     iso2  iso3   year diagnosis gender ages  value
#>    <chr>       <chr> <chr> <int> <chr>     <chr>  <chr> <int>
#>  1 Afghanistan AF    AFG    1980 sp        m      014      NA
#>  2 Afghanistan AF    AFG    1980 sp        m      1524     NA
#>  3 Afghanistan AF    AFG    1980 sp        m      2534     NA
#>  4 Afghanistan AF    AFG    1980 sp        m      3544     NA
#>  5 Afghanistan AF    AFG    1980 sp        m      4554     NA
#>  6 Afghanistan AF    AFG    1980 sp        m      5564     NA
#>  7 Afghanistan AF    AFG    1980 sp        m      65       NA
#>  8 Afghanistan AF    AFG    1980 sp        f      014      NA
#>  9 Afghanistan AF    AFG    1980 sp        f      1524     NA
#> 10 Afghanistan AF    AFG    1980 sp        f      2534     NA
#> # ... with 405,430 more rows

The syntax above is somewhat repetitive, since it requires specifying columns to melt as the second argument, and it also requires names_pattern which could be used to identify the columns to melt, e.g.

names.pattern <- "new_?(.*)_(.)(.*)"
tidyr::pivot_longer(
  who, grep(names.pattern, names(who)),
  names_to=c("diagnosis", "gender", "ages"),
  names_pattern=names.pattern)
#> # A tibble: 405,440 x 8
#>    country     iso2  iso3   year diagnosis gender ages  value
#>    <chr>       <chr> <chr> <int> <chr>     <chr>  <chr> <int>
#>  1 Afghanistan AF    AFG    1980 sp        m      014      NA
#>  2 Afghanistan AF    AFG    1980 sp        m      1524     NA
#>  3 Afghanistan AF    AFG    1980 sp        m      2534     NA
#>  4 Afghanistan AF    AFG    1980 sp        m      3544     NA
#>  5 Afghanistan AF    AFG    1980 sp        m      4554     NA
#>  6 Afghanistan AF    AFG    1980 sp        m      5564     NA
#>  7 Afghanistan AF    AFG    1980 sp        m      65       NA
#>  8 Afghanistan AF    AFG    1980 sp        f      014      NA
#>  9 Afghanistan AF    AFG    1980 sp        f      1524     NA
#> 10 Afghanistan AF    AFG    1980 sp        f      2534     NA
#> # ... with 405,430 more rows

To accomplish the type conversion using tidyr we need to do a post-processing step, which requires additional repetition of the column names we want to convert:

(years.pattern.str <- nc::var_args_list(years.pattern)$pattern)
#> [1] "(?:(?:new_?(.*)_(.))((0|[0-9]{2})([0-9]{0,2})))"
(tidyr.nona <- tidyr::pivot_longer(
  who,
  grep(years.pattern.str, names(who)),
  names_to=c("diagnosis", "gender", "ages", "ymin", "ymax"),
  values_drop_na=TRUE,
  names_pattern=years.pattern.str))
#> # A tibble: 76,046 x 10
#>    country     iso2  iso3   year diagnosis gender ages  ymin  ymax  value
#>    <chr>       <chr> <chr> <int> <chr>     <chr>  <chr> <chr> <chr> <int>
#>  1 Afghanistan AF    AFG    1997 sp        m      014   0     "14"      0
#>  2 Afghanistan AF    AFG    1997 sp        m      1524  15    "24"     10
#>  3 Afghanistan AF    AFG    1997 sp        m      2534  25    "34"      6
#>  4 Afghanistan AF    AFG    1997 sp        m      3544  35    "44"      3
#>  5 Afghanistan AF    AFG    1997 sp        m      4554  45    "54"      5
#>  6 Afghanistan AF    AFG    1997 sp        m      5564  55    "64"      2
#>  7 Afghanistan AF    AFG    1997 sp        m      65    65    ""        0
#>  8 Afghanistan AF    AFG    1997 sp        f      014   0     "14"      5
#>  9 Afghanistan AF    AFG    1997 sp        f      1524  15    "24"     38
#> 10 Afghanistan AF    AFG    1997 sp        f      2534  25    "34"     36
#> # ... with 76,036 more rows
tidyr.typed <- transform(
  tidyr.nona,
  min.years=as.numeric(ymin),
  max.years=ifelse(is.na(ymax), Inf, as.numeric(ymax)))
str(tidyr.typed)
#> 'data.frame':    76046 obs. of  12 variables:
#>  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
#>  $ iso2     : chr  "AF" "AF" "AF" "AF" ...
#>  $ iso3     : chr  "AFG" "AFG" "AFG" "AFG" ...
#>  $ year     : int  1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
#>  $ diagnosis: chr  "sp" "sp" "sp" "sp" ...
#>  $ gender   : chr  "m" "m" "m" "m" ...
#>  $ ages     : chr  "014" "1524" "2534" "3544" ...
#>  $ ymin     : chr  "0" "15" "25" "35" ...
#>  $ ymax     : chr  "14" "24" "34" "44" ...
#>  $ value    : int  0 10 6 3 5 2 0 5 38 36 ...
#>  $ min.years: num  0 15 25 35 45 55 65 0 15 25 ...
#>  $ max.years: num  14 24 34 44 54 64 NA 14 24 34 ...

In conclusion nc::capture_melt_single provides an alternative which (1) makes it easier to define complex patterns (2) supports type conversion without a post-processing step, and (3) reduces repetition in user code.

Melting columns of different types into multiple result columns

In the previous sections we discussed methods for melting wide data columns that all have the same type. Sometimes wide data have columns of different types that we want to melt. The example below is taken from a data.table vignette,

family.dt <- fread(text="
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1         30 1998-11-26 2000-01-29         NA             1             2            NA
2         27 1996-06-22         NA         NA             2            NA            NA
3         26 2002-07-11 2004-04-05 2007-09-02             2             2             1
4         32 2004-10-10 2009-08-27 2012-07-21             1             1             1
5         29 2000-12-05 2005-02-28         NA             2             1            NA")

The data table above has one row for each family, and two columns for each child. One column is the date of birth (character), the other is the gender (coded as an integer). We can use standard data.table functions to melt these data into a tall table with one row for each child:

melt(family.dt, measure.vars=patterns(
  dob="^dob", gender="^gender"
))
#>     family_id age_mother variable        dob gender
#>  1:         1         30        1 1998-11-26      1
#>  2:         2         27        1 1996-06-22      2
#>  3:         3         26        1 2002-07-11      2
#>  4:         4         32        1 2004-10-10      1
#>  5:         5         29        1 2000-12-05      2
#>  6:         1         30        2 2000-01-29      2
#>  7:         2         27        2       <NA>     NA
#>  8:         3         26        2 2004-04-05      2
#>  9:         4         32        2 2009-08-27      1
#> 10:         5         29        2 2005-02-28      1
#> 11:         1         30        3       <NA>     NA
#> 12:         2         27        3       <NA>     NA
#> 13:         3         26        3 2007-09-02      1
#> 14:         4         32        3 2012-07-21      1
#> 15:         5         29        3       <NA>     NA

The melt above results in a table with one row for each child, and one column for each argument of patterns. However the code is somewhat repetitive, as dob and gender must be repeated. To avoid this repetition we can instead use:

(children.dt <- nc::capture_melt_multiple(
  family.dt,
  column=".*",
  "_",
  nc::field("child", "", "[1-3]", as.integer),
  na.rm=TRUE))
#>     family_id age_mother child        dob gender
#>  1:         1         30     1 1998-11-26      1
#>  2:         2         27     1 1996-06-22      2
#>  3:         3         26     1 2002-07-11      2
#>  4:         4         32     1 2004-10-10      1
#>  5:         5         29     1 2000-12-05      2
#>  6:         1         30     2 2000-01-29      2
#>  7:         3         26     2 2004-04-05      2
#>  8:         4         32     2 2009-08-27      1
#>  9:         5         29     2 2005-02-28      1
#> 10:         3         26     3 2007-09-02      1
#> 11:         4         32     3 2012-07-21      1
str(children.dt)
#> Classes 'data.table' and 'data.frame':   11 obs. of  5 variables:
#>  $ family_id : int  1 2 3 4 5 1 3 4 5 3 ...
#>  $ age_mother: int  30 27 26 32 29 30 26 32 29 26 ...
#>  $ child     : int  1 1 1 1 1 2 2 2 2 3 ...
#>  $ dob       : chr  "1998-11-26" "1996-06-22" "2002-07-11" "2004-10-10" ...
#>  $ gender    : int  1 2 2 1 2 2 2 1 1 1 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

The result above is similar to the previous result (and in fact melt.data.table is used internally), but the code is less repetitive. The first argument of capture_melt_multiple is the subject data table and the following arguments form a pattern which is matched to the input data column names. The pattern must have at least two groups:

Another example is the iris data set which usually has one 5-dimensional observation per row (four numeric measures, and one character Species). Below we create a shuffled version of the iris data with two observations per row:

set.seed(1)
iris.rand <- iris.dt[sample(.N)]
iris.wide <- cbind(treatment=iris.rand[1:75], control=iris.rand[76:150])
print(iris.wide, topn=2, nrows=10)
#>     treatment.i treatment.Sepal.Length treatment.Sepal.Width
#>  1:          68                    5.8                   2.7
#>  2:         129                    6.4                   2.8
#> ---                                                         
#> 74:          91                    5.5                   2.6
#> 75:          64                    6.1                   2.9
#>     treatment.Petal.Length treatment.Petal.Width treatment.Species control.i
#>  1:                    4.1                   1.0        versicolor        60
#>  2:                    5.6                   2.1         virginica       113
#> ---                                                                         
#> 74:                    4.4                   1.2        versicolor        57
#> 75:                    4.7                   1.4        versicolor        72
#>     control.Sepal.Length control.Sepal.Width control.Petal.Length
#>  1:                  5.2                 2.7                  3.9
#>  2:                  6.8                 3.0                  5.5
#> ---                                                              
#> 74:                  6.3                 3.3                  4.7
#> 75:                  6.1                 2.8                  4.0
#>     control.Petal.Width control.Species
#>  1:                 1.4      versicolor
#>  2:                 2.1       virginica
#> ---                                    
#> 74:                 1.6      versicolor
#> 75:                 1.3      versicolor

Below we show the usual data.table syntax for getting the original iris back.

iris.melted <- melt(iris.wide, measure.vars = patterns(
  i="i$",
  Sepal.Length="Sepal.Length$",
  Sepal.Width="Sepal.Width$",
  Petal.Length="Petal.Length$",
  Petal.Width="Petal.Width$",
  Species="Species$"))
identical(iris.melted[order(i), names(iris.dt), with=FALSE], iris.dt)
#> [1] TRUE

We can do the same thing with less repetition via:

(nc.melted <- nc::capture_melt_multiple(
  iris.wide,
  group=".*?",
  "[.]",
  column=".*"))
#>          group Petal.Length Petal.Width Sepal.Length Sepal.Width    Species   i
#>   1:   control          3.9         1.4          5.2         2.7 versicolor  60
#>   2:   control          5.5         2.1          6.8         3.0  virginica 113
#>   3:   control          5.6         1.4          6.1         2.6  virginica 135
#>   4:   control          1.5         0.1          4.9         3.1     setosa  10
#>   5:   control          1.4         0.2          5.1         3.5     setosa   1
#>  ---                                                                           
#> 146: treatment          1.6         0.2          4.8         3.1     setosa  31
#> 147: treatment          1.3         0.4          5.4         3.9     setosa  17
#> 148: treatment          5.4         2.1          6.9         3.1  virginica 140
#> 149: treatment          4.4         1.2          5.5         2.6 versicolor  91
#> 150: treatment          4.7         1.4          6.1         2.9 versicolor  64
identical(nc.melted[order(i), names(iris.dt), with=FALSE], iris.dt)
#> [1] TRUE

To conclude this section, we have seen that a simple non-repetitive regex syntax can be used with nc::capture_melt_multiple to melt wide data into several columns of different types.

Comparison with tidyr syntax

The syntax below can be used to get the tall data table of children:

tidyr::pivot_longer(
  family.dt,
  cols=dob_child1:gender_child3,
  names_pattern="(.*)_child([1-3])",
  names_to=c(".value", "child.str"),
  values_drop_na=TRUE)
#> # A tibble: 11 x 5
#>    family_id age_mother child.str dob        gender
#>        <int>      <int> <chr>     <chr>       <int>
#>  1         1         30 1         1998-11-26      1
#>  2         1         30 2         2000-01-29      2
#>  3         2         27 1         1996-06-22      2
#>  4         3         26 1         2002-07-11      2
#>  5         3         26 2         2004-04-05      2
#>  6         3         26 3         2007-09-02      1
#>  7         4         32 1         2004-10-10      1
#>  8         4         32 2         2009-08-27      1
#>  9         4         32 3         2012-07-21      1
#> 10         5         29 1         2000-12-05      2
#> 11         5         29 2         2005-02-28      1

However the code is somewhat repetitive, as the columns to melt must be specified in two arguments: cols and names_pattern. A workaround is to use a pre-defined pattern for both arguments:

child.pat.str <- "(.*)_child([1-3])"
tidyr.children <- tidyr::pivot_longer(
  family.dt,
  cols=grep(child.pat.str, names(family.dt)),
  names_pattern=child.pat.str,
  names_to=c(".value", "child.str"),
  values_drop_na=TRUE)
(tidyr.converted <- transform(
  tidyr.children,
  child=as.integer(child.str)))
#>    family_id age_mother child.str        dob gender child
#> 1          1         30         1 1998-11-26      1     1
#> 2          1         30         2 2000-01-29      2     2
#> 3          2         27         1 1996-06-22      2     1
#> 4          3         26         1 2002-07-11      2     1
#> 5          3         26         2 2004-04-05      2     2
#> 6          3         26         3 2007-09-02      1     3
#> 7          4         32         1 2004-10-10      1     1
#> 8          4         32         2 2009-08-27      1     2
#> 9          4         32         3 2012-07-21      1     3
#> 10         5         29         1 2000-12-05      2     1
#> 11         5         29         2 2005-02-28      1     2
str(tidyr.converted)
#> 'data.frame':    11 obs. of  6 variables:
#>  $ family_id : int  1 1 2 3 3 3 4 4 4 5 ...
#>  $ age_mother: int  30 30 27 26 26 26 32 32 32 29 ...
#>  $ child.str : chr  "1" "2" "1" "1" ...
#>  $ dob       : chr  "1998-11-26" "2000-01-29" "1996-06-22" "2002-07-11" ...
#>  $ gender    : int  1 2 2 2 2 1 1 1 1 2 ...
#>  $ child     : int  1 2 1 1 2 3 1 2 3 1 ...

However there is still some repetition: child appears in both the pattern string and the names_to argument. That repetition is avoided if nc::capture_melt_multiple is used with nc::field.

Another difference is that nc allows specification of type conversion functions in the pattern, which results in an integer child column. In contrast with tidyr output columns such as child.str which are created using a regex capture group are always character, so type conversions must be specified as a post-processing step.

To conclude this section, we have seen that tidyr::pivot_longer uses a different syntax to perform a similar computation to nc::capture_melt_multiple. Two differences are that nc syntax is less repetitive, and allows specifying type conversion functions in the pattern, which avoids the need for a post-processing step.