Design Process and Exposure Format

Evgeni Chasnovski

2019-02-15

The main idea of the ruler package is to create a format of validation results (along with functional API) that will work naturally with tidyverse tools. This vignette will:

Design process

The preferred local data structure in tidyverse is tibble: “A modern re-imagining of the data frame”, on which its implementation is based. That is why ruler uses data frames as preferred format for data to be validated. However the initial goal is to use tibbles in creation of validation result format as much as possible.

Basically data frame is a list of variables with the same length. It is easier to think about it as two-dimensional structure where columns can be of different types.

In abstract form validation of data frame can be put as asking whether certain subset of data frame (data unit) obeys certain rule. The result of validation is logical value representing an answer.

With influence of dplyr’s grammar of data manipulation a data frame can be represented in terms of the following data units:

In ruler data, group, column, row and cell are five basic data units. They all can be described by the combination of two variables:

Validation of data units can be done with the dplyr functions described above. Their application to some data unit can give answers to multiple questions. That is why by design rules (functions that answer one certain question about one type of data unit) are combined in rule packs (functions that answer multiple questions about one type of data unit).

Application of rule pack to data is connected with several points:

In ruler exposing data to rules means applying rule packs to data, collecting results in common format and attaching them to the data as an exposure attribute. In this way actual exposure can be done in multiple steps and also be a part of a general data preparation pipeline.

Exposure

Exposure is a format designed to contain uniform information about validation of different data units. For reproducibility it also saves information about applied packs. Basically exposure is a list with two elements:

  1. Packs info: a tibble with the following structure:
    • name <chr> : Name of the pack. If not set manually it will be imputed during exposure.
    • type <chr> : Name of pack type. Indicates which data unit pack checks.
    • fun <list> : List (preferably unnamed) of rule pack functions.
    • remove_obeyers <lgl> : Whether rows about obeyers (data units that obey certain rule) were removed from report after applying pack.
  2. Tidy data validation report: a tibble with the following structure:
    • pack <chr> : Name of rule pack (from column ‘name’ in packs info).
    • rule <chr> : Name of the rule defined in rule pack.
    • var <chr> : Name of the data unit variable.
    • id <int> : Row index of data unit.
    • value <lgl> : Whether the described data unit obeys the rule.