Rule Packs

Evgeni Chasnovski

2019-02-15

This vignette describes and explains logic behind common ways of creating rule packs.

Overview

Rule is a function which converts data unit of interest (data, group, column, row, cell) to logical value indicating whether this object satisfies certain condition.

Rule pack is a function which combines several rules for common data unit into one functional block. The recommended way of creating rules is by creating packs right away with the use of dplyr and magrittr’s pipe operator.

Some of ruler’s functionality is powered by the keyholder package. It is highly recommended to use its supported functions during rule pack construction. All one- and two-table dplyr verbs applied to local data frames are supported and considered the most appropriate way to create rule packs.

As described in vignette about design process it is necessary for rule pack to have type because outputs for different data units have different structure. For this reason ruler has family of *_packs() constructors (where * stands for the name of data unit):

Data rule packs

To check whether dimensions of mtcars obey some rules one can write the next dplyr pipeline:

mtcars %>% summarise(
  nrow_low = nrow(.) > 10,
  nrow_high = nrow(.) < 30,
  ncol = ncol(.) == 12
)
#>   nrow_low nrow_high  ncol
#> 1     TRUE     FALSE FALSE

The output has the following structure:

There is an easy way to transform this pipeline into a function to be used for any data: mtcars should be replaced with . character. To indicate that this function is a rule pack for data unit ‘data’ it should be wrapped with data_packs().

The next code creates a list my_data_packs with one data rule pack named my_data_pack_1. That rule pack defines rules with names nrow_low, nrow_high, ncol.

my_data_packs <- data_packs(
  my_data_pack_1 = . %>% summarise(
    nrow_low = nrow(.) > 10,
    nrow_high = nrow(.) < 30,
    ncol = ncol(.) == 12
  )
)

Group rule packs

To check whether certain groups of rows of mtcars obey some rules one can write the next dplyr pipeline:

mtcars %>% group_by(vs, am) %>%
    summarise(any_cyl_6 = any(cyl == 6))
#> # A tibble: 4 x 3
#> # Groups:   vs [2]
#>      vs    am any_cyl_6
#>   <dbl> <dbl> <lgl>    
#> 1     0     0 FALSE    
#> 2     0     1 TRUE     
#> 3     1     0 TRUE     
#> 4     1     1 FALSE

The output has the following structure:

The next code creates a list with one nameless group rule pack (the name will be imputed during exposure). This pack contains one rule any_cyl_6 which checks every group defined by vs and am columns.

my_group_packs <- group_packs(
  . %>% group_by(vs, am) %>%
    summarise(any_cyl_6 = any(cyl == 6)),
  .group_vars = c("vs", "am")
)

Notes:

Column rule packs

To check whether certain columns of mtcars obey some rules one can write the next dplyr pipeline:

is_integerish <- function(x) {all(x == as.integer(x))}

mtcars %>%
  summarise_if(is_integerish, funs(mean_low = mean(.) > 0.5))
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> please use list() instead
#> 
#> # Before:
#> funs(name = f(.)
#> 
#> # After: 
#> list(name = ~f(.))
#> This warning is displayed once per session.
#>   cyl_mean_low hp_mean_low vs_mean_low am_mean_low gear_mean_low
#> 1         TRUE        TRUE       FALSE       FALSE          TRUE
#>   carb_mean_low
#> 1          TRUE

The output has the following structure:

In general it is hard to automatically separate output’s column names into ‘validated column name’ and ‘rule name’ because default separator _ is a commonly used one. For this reason ruler has function rules() which wraps funs() with the following functionality:

The next code creates a list with two elements:

my_col_packs <- col_packs(
  my_col_pack_1 = . %>% summarise_if(
    is_integerish,
    rules(mean_low = mean(.) > 0.5)
  ),
  . %>% summarise_at(vars(vs = "vs"), rules(sum(.) > 300))
)

Row rule pack

To check whether certain rows of mtcars are not outliers one can write the next dplyr pipeline:

z_score <- function(x) {(x - mean(x)) / sd(x)}

mtcars %>%
  mutate(rowMean = rowMeans(.)) %>%
  transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%
  slice(10:15)
#>   is_common_row_mean
#> 1               TRUE
#> 2               TRUE
#> 3               TRUE
#> 4               TRUE
#> 5               TRUE
#> 6              FALSE

The output has the following structure:

Pipeline like the one above is quite common: for every row compute some value based on all rows and then validate only some of them. However in the validation report column id should represent the row index in the original data frame and this information is missing after applying slice().

This problem is solved by using keyholder package. Its main purpose is to track information about rows while modifying data frame. During exposure pack is applied to the keyed version of input data with key equals to row index. Note that to use this feature one should create rule packs using composition of functions supported by keyholder.

The next code creates a list with one row pack my_row_pack_1. It contains one rule is_common_row_mean that checks 6 rows (from 10 to 15) for not being an outlier (based on information from all rows) in terms of row means.

my_row_packs <- row_packs(
  my_row_pack_1 = . %>% mutate(rowMean = rowMeans(.)) %>%
    transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>%
    slice(10:15)
)

Cell rule pack

To check whether certain cells of mtcars are not outliers one can write the next dplyr pipeline:

mtcars %>% transmute_if(
    is_integerish,
    funs(is_common = abs(z_score(.)) < 1)
  ) %>%
    slice(20:24)
#>   cyl_is_common hp_is_common vs_is_common am_is_common gear_is_common
#> 1         FALSE        FALSE        FALSE        FALSE           TRUE
#> 2         FALSE         TRUE        FALSE         TRUE           TRUE
#> 3         FALSE         TRUE         TRUE         TRUE           TRUE
#> 4         FALSE         TRUE         TRUE         TRUE           TRUE
#> 5         FALSE        FALSE         TRUE         TRUE           TRUE
#>   carb_is_common
#> 1          FALSE
#> 2          FALSE
#> 3           TRUE
#> 4           TRUE
#> 5           TRUE

The output has the following structure:

Basically cell rule pack is a combination of column and row rule packs. It means:

The next code creates a list with one cell pack my_cell_pack_1. It checks cells of every integer-like column in rows 20-24 for not being an outlier within column.

my_cell_packs <- cell_packs(
  my_cell_pack_1 = . %>% transmute_if(
    is_integerish,
    rules(is_common = abs(z_score(.)) < 1)
  ) %>%
    slice(20:24)
)