Three argument syntax

Toby Dylan Hocking

2019-02-25

The str_match_named and str_match_all_named functions take exactly three arguments as input: - subject is the character vector from which we want to extract tabular data. - pattern is the (character scalar) regular expression with named capture groups used for extraction. The named capture groups should be literally specified, e.g. (?P<group_name>subpattern) - fun.list is a list with names that correspond to capture groups, and values are functions used to convert the extracted character data to other (typically numeric) types.

Genomic range example

For example, consider the character vector of subject strings below:

subject.vec <- c(
  "chr10:213,054,000-213,055,000",
  "chrM:111,000",
  "this will not match",
  NA, # neither will this.
  "chr1:110-111 chr2:220-222") # two possible matches.

With these genomic range subjects, the goal is to extract the chromosome name (before the colon), along with the numeric start and end locations (separated by a dash). That pattern is coded below,

chr.pos.pattern <- paste0(
  "(?P<chrom>chr.*?)",
  ":",
  "(?P<chromStart>[0-9,]+)",
  "(?:",
    "-",
    "(?P<chromEnd>[0-9,]+)",
  ")?")

Note that it is often preferable (as above) to code the pattern using paste0, in order to put different parts of the pattern on each line. In particular, I recommend putting each named capture group on its own line – this is one of the ideas that is used in the variable argument syntax (see other vignette for more info).

Extract first match from each subject

Using the pattern on the subjects above results in

(match.mat <- namedCapture::str_match_named(subject.vec, chr.pos.pattern))
#>      chrom   chromStart    chromEnd     
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM"  "111,000"     ""           
#> [3,] NA      NA            NA           
#> [4,] NA      NA            NA           
#> [5,] "chr1"  "110"         "111"
str(match.mat)
#>  chr [1:5, 1:3] "chr10" "chrM" NA NA "chr1" "213,054,000" "111,000" NA ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:3] "chrom" "chromStart" "chromEnd"

Note that the third argument (list of conversion functions) is omitted in the code above. In that case, the return value is a character matrix, in which missing values indicate missing subjects or no match. The empty string is used for optional groups which are not used in the match (e.g. chromEnd group/column for second subject).

Third argument: list of type conversion functions

However we often want to extract numeric data – in this case we want to convert chromStart/End to integers. You can do that by supplying a named list of conversion functions as the third argument. Each function should take exactly one argument, a character vector (data in the matched column/group), and return a vector of the same size. The code below specifies the keep.digits function for both chromStart and chromEnd.

keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
conversion.list <- list(chromStart=keep.digits, chromEnd=keep.digits)
(match.df <- namedCapture::str_match_named(
  subject.vec, chr.pos.pattern, conversion.list))
#>   chrom chromStart  chromEnd
#> 1 chr10  213054000 213055000
#> 2  chrM     111000        NA
#> 3  <NA>         NA        NA
#> 4  <NA>         NA        NA
#> 5  chr1        110       111
str(match.df)
#> 'data.frame':    5 obs. of  3 variables:
#>  $ chrom     : chr  "chr10" "chrM" NA NA ...
#>  $ chromStart: int  213054000 111000 NA NA 110
#>  $ chromEnd  : int  213055000 NA NA NA 111

Note that a data.frame is returned when the third argument is specified, in order to handle non-character data types returned by the conversion functions.

Extract all matches from each subject

Note in the examples above that the last subject has two possible matches, but only the first is returned by str_match_named. Use str_match_all_named to get ALL matches in each subject (not just the first match).

namedCapture::str_match_all_named(
  subject.vec, chr.pos.pattern, conversion.list)
#> [[1]]
#>   chrom chromStart  chromEnd
#> 1 chr10  213054000 213055000
#> 
#> [[2]]
#>   chrom chromStart chromEnd
#> 1  chrM     111000       NA
#> 
#> [[3]]
#> data frame with 0 columns and 0 rows
#> 
#> [[4]]
#> data frame with 0 columns and 0 rows
#> 
#> [[5]]
#>   chrom chromStart chromEnd
#> 1  chr1        110      111
#> 2  chr2        220      222

As shown above, the result is a list with one element for each subject. Each list element is a data.frame with one row for each match.

Named output

If the pattern specifies the name group, then it will be used for the rownames of the output, and it will not be included as a column. For example the pattern below uses name for the first group:

name.pattern <- paste0(
  "(?P<name>chr.*?)",
  ":",
  "(?P<chromStart>[0-9,]+)",
  "(?:",
    "-",
    "(?P<chromEnd>[0-9,]+)",
  ")?")
try(named.mat <- namedCapture::str_match_named(
  subject.vec, name.pattern, conversion.list))
#>   name chromStart chromEnd
#> 3 <NA>       <NA>     <NA>
#> 4 <NA>       <NA>     <NA>
(named.mat <- namedCapture::str_match_named(
  subject.vec[-(3:4)], name.pattern, conversion.list))
#>       chromStart  chromEnd
#> chr10  213054000 213055000
#> chrM      111000        NA
#> chr1         110       111
(named.list <- namedCapture::str_match_all_named(
  subject.vec, name.pattern, conversion.list))
#> [[1]]
#>       chromStart  chromEnd
#> chr10  213054000 213055000
#> 
#> [[2]]
#>      chromStart chromEnd
#> chrM     111000       NA
#> 
#> [[3]]
#> data frame with 0 columns and 0 rows
#> 
#> [[4]]
#> data frame with 0 columns and 0 rows
#> 
#> [[5]]
#>      chromStart chromEnd
#> chr1        110      111
#> chr2        220      222

Note in the above code we use try because it is an error if any name groups are missing (and they are for the subjects 3 and 4).

The named output feature makes it easy to select particular elements of the extracted data by name, e.g.

named.mat["chr1", "chromStart"]
#> [1] 110
named.list[[5]]["chr2", "chromStart"]
#> [1] 220

Note that if the subject is named, its names will be used to name the output (rownames or list names).

named.subject.vec <- c(
  ten="chr10:213,054,000-213,055,000",
  M="chrM:111,000",
  nomatch="this will not match",
  missing=NA, # neither will this.
  two="chr1:110-111 chr2:220-222") # two possible matches.
namedCapture::str_match_named(
  named.subject.vec, chr.pos.pattern, conversion.list)
#>         chrom chromStart  chromEnd
#> ten     chr10  213054000 213055000
#> M        chrM     111000        NA
#> nomatch  <NA>         NA        NA
#> missing  <NA>         NA        NA
#> two      chr1        110       111
namedCapture::str_match_all_named(
  named.subject.vec, chr.pos.pattern, conversion.list)
#> $ten
#>   chrom chromStart  chromEnd
#> 1 chr10  213054000 213055000
#> 
#> $M
#>   chrom chromStart chromEnd
#> 1  chrM     111000       NA
#> 
#> $nomatch
#> data frame with 0 columns and 0 rows
#> 
#> $missing
#> data frame with 0 columns and 0 rows
#> 
#> $two
#>   chrom chromStart chromEnd
#> 1  chr1        110      111
#> 2  chr2        220      222

If the subject has names, and the name group is specified, then the subject names are used to name the output (and the name column is included in the output).

named.subject.vec <- c(
  ten="chr10:213,054,000-213,055,000",
  M="chrM:111,000",
  nomatch="this will not match",
  missing=NA, # neither will this.
  two="chr1:110-111 chr2:220-222") # two possible matches.
namedCapture::str_match_named(
  named.subject.vec, name.pattern, conversion.list)
#>          name chromStart  chromEnd
#> ten     chr10  213054000 213055000
#> M        chrM     111000        NA
#> nomatch  <NA>         NA        NA
#> missing  <NA>         NA        NA
#> two      chr1        110       111
namedCapture::str_match_all_named(
  named.subject.vec, name.pattern, conversion.list)
#> $ten
#>       chromStart  chromEnd
#> chr10  213054000 213055000
#> 
#> $M
#>      chromStart chromEnd
#> chrM     111000       NA
#> 
#> $nomatch
#> data frame with 0 columns and 0 rows
#> 
#> $missing
#> data frame with 0 columns and 0 rows
#> 
#> $two
#>      chromStart chromEnd
#> chr1        110      111
#> chr2        220      222