Variable argument syntax

Toby Dylan Hocking

2019-02-25

This is the second vignette – we assume you have already read the “three argument syntax” vignette which covers the most basic namedCapture functions, str_match_named and str_match_all_named. Here we introduce the syntax used in the namedCapture::*_variable functions, which is motivated by the desire to avoid repetitive/boilerplate code. In the previous vignette we used the following code to extract the first match from each subject,

subject.vec <- c(
  "chr10:213,054,000-213,055,000",
  "chrM:111,000",
  "this will not match",
  NA, # neither will this.
  "chr1:110-111 chr2:220-222") # two possible matches.
chr.pos.pattern <- paste0(
  "(?P<chrom>chr.*?)",
  ":",
  "(?P<chromStart>[0-9,]+)",
  "(?:",
    "-",
    "(?P<chromEnd>[0-9,]*)",
  ")?")
namedCapture::str_match_named(subject.vec, chr.pos.pattern)
#>      chrom   chromStart    chromEnd     
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM"  "111,000"     ""           
#> [3,] NA      NA            NA           
#> [4,] NA      NA            NA           
#> [5,] "chr1"  "110"         "111"

Note that the pattern above is defined using the paste0 boilerplate, which is used to break the pattern over several lines for clarity. Using the variable argument syntax, we can omit paste0, and simply supply the pattern strings to str_match_variable directly,

namedCapture::str_match_variable(
  subject.vec, 
  "(?P<chrom>chr.*?)",
  ":",
  "(?P<chromStart>[0-9,]+)",
  "(?:",
    "-",
    "(?P<chromEnd>[0-9,]+)",
  ")?")
#>      chrom   chromStart    chromEnd     
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM"  "111,000"     ""           
#> [3,] NA      NA            NA           
#> [4,] NA      NA            NA           
#> [5,] "chr1"  "110"         "111"

We can further simplify by removing the named capture groups from the strings, and adding names to the corresponding arguments. For name1="pattern1", namedCapture internally generates/uses the regex (?P<name1>pattern1).

namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+",
  "(?:",
    "-",
    chromEnd="[0-9,]+",
  ")?")
#>      chrom   chromStart    chromEnd     
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM"  "111,000"     ""           
#> [3,] NA      NA            NA           
#> [4,] NA      NA            NA           
#> [5,] "chr1"  "110"         "111"

We can add type conversion functions on the same line as the definition of the named group:

keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+", keep.digits,
  "(?:",
    "-",
    chromEnd="[0-9,]+", keep.digits,
  ")?")
#>   chrom chromStart  chromEnd
#> 1 chr10  213054000 213055000
#> 2  chrM     111000        NA
#> 3  <NA>         NA        NA
#> 4  <NA>         NA        NA
#> 5  chr1        110       111

Note the repetition in the chromStart/End lines – the same pattern and type conversion function is used for each group. This repetition can be avoided by creating and using a sub-pattern list variable,

pos.pattern <- list("[0-9,]+", keep.digits)
namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  "(?:",
    "-",
    chromEnd=pos.pattern,
  ")?")
#>   chrom chromStart  chromEnd
#> 1 chr10  213054000 213055000
#> 2  chrM     111000        NA
#> 3  <NA>         NA        NA
#> 4  <NA>         NA        NA
#> 5  chr1        110       111

Finally, the non-capturing group can be replaced by an un-named list:

pos.pattern <- list("[0-9,]+", keep.digits)
namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?")
#>   chrom chromStart  chromEnd
#> 1 chr10  213054000 213055000
#> 2  chrM     111000        NA
#> 3  <NA>         NA        NA
#> 4  <NA>         NA        NA
#> 5  chr1        110       111

In summary, the str_match_variable function takes a variable number of arguments, and allows for a shorter, less repetitive, and thus more user-friendly syntax:

Extract all patterns from a file

The variable argument syntax can also be used with str_match_all_variable, which is for the common case of extracting each match from a multi-line text file. In this section we demonstrate how to use str_match_all_variable to extract data.frames from a loosely structured text file.

trackDb.txt.gz <- system.file(
  "extdata", "trackDb.txt.gz", package="namedCapture")
trackDb.vec <- readLines(trackDb.txt.gz)

Some representative lines from that file are shown below.

cat(trackDb.vec[78:107], sep="\n")
#> track peaks_summary
#> type bigBed 5
#> shortLabel _model_peaks_summary
#> longLabel Regions with a peak in at least one sample
#> visibility pack
#> itemRgb off
#> spectrum on
#> bigDataUrl http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed
#> 
#> 
#>  track bcell_McGill0091
#>  parent bcell
#>  container multiWig
#>  type bigWig
#>  shortLabel bcell_McGill0091
#>  longLabel bcell | McGill0091
#>  graphType points
#>  aggregate transparentOverlay
#>  showSubtrackColorOnUi on
#>  maxHeightPixels 25:12:8
#>  visibility full
#>  autoScale on
#> 
#>   track bcell_McGill0091Coverage
#>   bigDataUrl http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig
#>   shortLabel bcell_McGill0091Coverage
#>   longLabel bcell | McGill0091 | Coverage
#>   parent bcell_McGill0091
#>   type bigWig
#>   color 141,211,199

Each block of text begins with “track” and includes several lines of data before the block ends with two consecutive newlines. That pattern is coded below using a regex:

fields.df <- namedCapture::str_match_all_variable(
  trackDb.vec,
  "track ",
  name="\\S+",
  fields="(?:\n[^\n]+)*",
  "\n")

Note that this function assumes that its first argument is a character vector with one element for each line in a file. Therefore the result contains no information about which subject element each match comes from (to get that, use str_match_all_named). The code above creates a data frame with one row for each track block, with rownames given by the track line (because of the capture group named name), and one fields column which is a string with the rest of the data in that block.

head(fields.df)
#>                        fields                                                                                                      
#> bcell                  "\nsuperTrack on show\nshortLabel bcell\nlongLabel bcell ChIP-seq samples"                                  
#> kidneyCancer           "\nsuperTrack on show\nshortLabel kidneyCancer\nlongLabel kidneyCancer ChIP-seq samples"                    
#> kidney                 "\nsuperTrack on show\nshortLabel kidney\nlongLabel kidney ChIP-seq samples"                                
#> leukemiaCD19CD10BCells "\nsuperTrack on show\nshortLabel leukemiaCD19CD10BCells\nlongLabel leukemiaCD19CD10BCells ChIP-seq samples"
#> monocyte               "\nsuperTrack on show\nshortLabel monocyte\nlongLabel monocyte ChIP-seq samples"                            
#> skeletalMuscleCtrl     "\nsuperTrack on show\nshortLabel skeletalMuscleCtrl\nlongLabel skeletalMuscleCtrl ChIP-seq samples"

Each block has a variable number of lines/fields. Each line starts with a field name, followed by a space, followed by the field value. That regex is coded below:

fields.list <- namedCapture::str_match_all_named(
  fields.df[, "fields"], paste0(
    "\\s+",
    "(?P<name>.*?)",
    " ",
    "(?P<value>[^\n]+)"))

Note that we used str_match_all_named which outputs a list in order to keep info about which match came from which subject. The result is a list of data frames.

fields.list[12:14]
#> $peaks_summary
#>            value                                                                  
#> type       "bigBed 5"                                                             
#> shortLabel "_model_peaks_summary"                                                 
#> longLabel  "Regions with a peak in at least one sample"                           
#> visibility "pack"                                                                 
#> itemRgb    "off"                                                                  
#> spectrum   "on"                                                                   
#> bigDataUrl "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed"
#> 
#> $bcell_McGill0091
#>                       value               
#> parent                "bcell"             
#> container             "multiWig"          
#> type                  "bigWig"            
#> shortLabel            "bcell_McGill0091"  
#> longLabel             "bcell | McGill0091"
#> graphType             "points"            
#> aggregate             "transparentOverlay"
#> showSubtrackColorOnUi "on"                
#> maxHeightPixels       "25:12:8"           
#> visibility            "full"              
#> autoScale             "on"                
#> 
#> $bcell_McGill0091Coverage
#>            value                                                                                      
#> bigDataUrl "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig"
#> shortLabel "bcell_McGill0091Coverage"                                                                 
#> longLabel  "bcell | McGill0091 | Coverage"                                                            
#> parent     "bcell_McGill0091"                                                                         
#> type       "bigWig"                                                                                   
#> color      "141,211,199"

There is a list element for each block, named by track. Each list element is a data frame with one row per field defined in that block (rownames are field names). The names/rownames make it easy to write R code that selects individual elements by name, e.g.

fields.list$bcell_McGill0091Coverage["bigDataUrl",]
#> [1] "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig"
fields.list$monocyte_McGill0001Peaks["color",]
#> [1] "0,0,0"
has.bigDataUrl <- sapply(fields.list, function(m)"bigDataUrl" %in% rownames(m))
bigDataUrl.list <- fields.list[has.bigDataUrl]
length(bigDataUrl.list)
#> [1] 78
length(fields.list)
#> [1] 123

So there are 78 tracks which define the bigDataUrl field, out of 123 total tracks.

In the example above we extracted all fields from all tracks (using two regexes, one for the track, one for the field). In the example below we extract only the bigDataUrl field for each track, and split sample names into separate columns (using a single regex for the track). It also demonstrates how to use nested named capture groups (via named lists which contain named regex strings).

name.pattern <- list(
  cellType=".*?",
  "_",
  sampleName=list(
    "McGill",
    sampleID="[0-9]+", as.integer),
  dataType="Coverage|Peaks",
  "|",
  "[^\n]+")
match.df <- namedCapture::str_match_all_variable(
  trackDb.vec,
  "track ",
  name=name.pattern,
  "(?:\n[^\n]+)*",
  "\\s+bigDataUrl ",
  bigDataUrl="[^\n]+")
head(match.df)
#>                          cellType sampleName sampleID dataType
#> all_labels                                         NA         
#> problems                                           NA         
#> jointProblems                                      NA         
#> peaks_summary                                      NA         
#> bcell_McGill0091Coverage    bcell McGill0091       91 Coverage
#> bcell_McGill0091Peaks       bcell McGill0091       91    Peaks
#>                                                                                                            bigDataUrl
#> all_labels                                         http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/all_labels.bigBed
#> problems                                             http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/problems.bigBed
#> jointProblems                                   http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/jointProblems.bigBed
#> peaks_summary                                   http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed
#> bcell_McGill0091Coverage    http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig
#> bcell_McGill0091Peaks    http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/joint_peaks.bigWig

Exercise for the reader: modify the above regex in order to capture three additional columns (red, green, blue) from the color field.

Extract several columns of a data frame

We also provide namedCapture::df_match_variable which extracts text from several columns of a data.frame, using a different named capture regular expression for each column.

(sacct.df <- data.frame(
  Elapsed = c(
    "07:04:42", "07:04:42", "07:04:49",
    "00:00:00", "00:00:00"),
  JobID=c(
    "13937810_25",
    "13937810_25.batch",
    "13937810_25.extern",
    "14022192_[1-3]",
    "14022204_[4]"),
  stringsAsFactors=FALSE))
#>    Elapsed              JobID
#> 1 07:04:42        13937810_25
#> 2 07:04:42  13937810_25.batch
#> 3 07:04:49 13937810_25.extern
#> 4 00:00:00     14022192_[1-3]
#> 5 00:00:00       14022204_[4]

Say we want to filter by the total Elapsed time (which is reported as hours:minutes:seconds), and base job id (which is the number before the underscore in the JobID column). We could start by converting those character columns to integers via:

## Define some sub-patterns separately for clarity.
range.pattern <- list(
  "[[]",
  task1="[0-9]+", as.integer,
  "(?:-",#begin optional end of range.
  taskN="[0-9]+", as.integer,
  ")?", #end is optional.
  "[]]")
task.pattern <- list(
  "(?:",#begin alternate
  task="[0-9]+", as.integer,
  "|",#either one task(above) or range(below)
  range.pattern,
  ")")#end alternate
(task.df <- namedCapture::df_match_variable(
  sacct.df,
  JobID=list(
    job="[0-9]+", as.integer,
    "_",
    task.pattern,
    "(?:[.]",
    type=".*",
    ")?"),
  Elapsed=list(
    hours="[0-9]+", as.integer,
    ":",
    minutes="[0-9]+", as.integer,
    ":",
    seconds="[0-9]+", as.integer)))
#>    Elapsed              JobID JobID.job JobID.task JobID.task1 JobID.taskN
#> 1 07:04:42        13937810_25  13937810         25          NA          NA
#> 2 07:04:42  13937810_25.batch  13937810         25          NA          NA
#> 3 07:04:49 13937810_25.extern  13937810         25          NA          NA
#> 4 00:00:00     14022192_[1-3]  14022192         NA           1           3
#> 5 00:00:00       14022204_[4]  14022204         NA           4          NA
#>   JobID.type Elapsed.hours Elapsed.minutes Elapsed.seconds
#> 1                        7               4              42
#> 2      batch             7               4              42
#> 3     extern             7               4              49
#> 4                        0               0               0
#> 5                        0               0               0

The result is another data frame with an additional column for each named capture group. Note that this also works with data.table:

library(data.table)
sacct.dt <- data.table(sacct.df)
(task.dt <- namedCapture::df_match_variable(
  sacct.dt,
  JobID=list(
    job="[0-9]+", as.integer,
    "_",
    task.pattern,
    "(?:[.]",
    type=".*",
    ")?"),
  Elapsed=list(
    hours="[0-9]+", as.integer,
    ":",
    minutes="[0-9]+", as.integer,
    ":",
    seconds="[0-9]+", as.integer)))
#>     Elapsed              JobID JobID.job JobID.task JobID.task1
#> 1: 07:04:42        13937810_25  13937810         25          NA
#> 2: 07:04:42  13937810_25.batch  13937810         25          NA
#> 3: 07:04:49 13937810_25.extern  13937810         25          NA
#> 4: 00:00:00     14022192_[1-3]  14022192         NA           1
#> 5: 00:00:00       14022204_[4]  14022204         NA           4
#>    JobID.taskN JobID.type Elapsed.hours Elapsed.minutes Elapsed.seconds
#> 1:          NA                        7               4              42
#> 2:          NA      batch             7               4              42
#> 3:          NA     extern             7               4              49
#> 4:           3                        0               0               0
#> 5:          NA                        0               0               0