Introduction

The validate package is intended to make checking your data easy, maintainable and reproducible. The package allows you to

There are a few nouns related to the infrastructure offered by validate,

There is also a single verb, namely

A quick example

Here’s an example demonstrating the typical workflow. We’ll use the built-in women data set (average heights and weights for American women aged 30-39).

data(women)
summary(women)
##      height         weight     
##  Min.   :58.0   Min.   :115.0  
##  1st Qu.:61.5   1st Qu.:124.5  
##  Median :65.0   Median :135.0  
##  Mean   :65.0   Mean   :136.7  
##  3rd Qu.:68.5   3rd Qu.:148.0  
##  Max.   :72.0   Max.   :164.0

Validating data is all about checking whether a data set meets presumptions or expectations you have about it, and the validate package makes it easy for you to define those expectations. Let’s do a quick check on variables in the women data set.

library(validate)
cf <- check_that(women, height > 0, weight > 0, height/weight > 0.5)
summary(cf)
##   rule items passes fails nNA error warning          expression
## 1   V1    15     15     0   0 FALSE   FALSE          height > 0
## 2   V2    15     15     0   0 FALSE   FALSE          weight > 0
## 3   V3    15      2    13   0 FALSE   FALSE height/weight > 0.5

check_that returns an object containing all sorts of information on the validation results. The easiest way to check the results is with summary, which returns a data.frame with the following basic information:

If you’re a fan of the pipe-operator provided by the magrittr, the above statement can also be performed as follows.

women %>% check_that(height > 0, weight > 0, height/weight > 0.5) %>% summary()

The same information can be summarized graphically as well.

barplot(cf,main="Checks on the women data set")

Validator objects

Validator objects are used to store, investigate and manipulate rule sets.

v <- validator(height > 0, weight > 0, height/weight > 0)
v
## Object of class 'validator' with 3 elements:
##  V1: height > 0
##  V2: weight > 0
##  V3: height/weight > 0

The validator object has stored the rule and assigned names to them for future reference. To check this, we confront the data set with the validation rules we’ve just defined:

cf <- confront(women,v)
cf
## Object of class 'validation'
## Call:
##     confront(x = women, dat = v)
## 
## Confrontations: 3
## With fails    : 0
## Warnings      : 0
## Errors        : 0

The object cf contains the result of checking the data in women against the expectations in v. The fact that there are no warnings or errors means that indeed each rule could be evaluated successfully (an error would occur for example, if we’d misspell height). Now let’s take a look at the actual results.

summary(cf)
##   rule items passes fails nNA error warning        expression
## 1   V1    15     15     0   0 FALSE   FALSE        height > 0
## 2   V2    15     15     0   0 FALSE   FALSE        weight > 0
## 3   V3    15     15     0   0 FALSE   FALSE height/weight > 0

Now, suppose that we expect that the BMI (weight divided by height squared) of each item to be below 23. We need to express the weight in kg and the height in meters, so the equation for BMI becomes \[ BMI = \frac{weight\times0.45359}{(height\times0.0254)^2} \] Moreover, assume that we suspect that the average BMI is between 22 and 22.5. Let’s create another validator object that first computes the BMI and next tests whether the BMI values conform to our suspicion.

v <- validator(
  BMI := (weight*0.45359)/(height*0.0254)^2
  , height > 0
  , weight > 0
  , BMI < 23
  , mean(BMI) > 22 & mean(BMI) < 22.5
)
v
## Object of class 'validator' with 5 elements:
##  V1: `:=`(BMI, (weight * 0.45359)/(height * 0.0254)^2)
##  V2: height > 0
##  V3: weight > 0
##  V4: BMI < 23
##  V5: mean(BMI) > 22 & mean(BMI) < 22.5

Checking is easy as before:

cf <- confront(women,v)
summary(cf)
##   rule items passes fails nNA error warning
## 1   V2    15     15     0   0 FALSE   FALSE
## 2   V3    15     15     0   0 FALSE   FALSE
## 3   V4    15     10     5   0 FALSE   FALSE
## 4   V5     1      0     1   0 FALSE   FALSE
##                                                                                                    expression
## 1                                                                                                  height > 0
## 2                                                                                                  weight > 0
## 3                                                               ((weight * 0.45359)/(height * 0.0254)^2) < 23
## 4 mean(((weight * 0.45359)/(height * 0.0254)^2)) > 22 & mean(((weight * 0.45359)/(height * 0.0254)^2)) < 22.5

Observe that the expressions for validation are now manipulated: everywhere where BMI was used, it was replaced with the computation defined before.

Confrontation objects

The outcome of confronting a validator object with a data set is an object of class confrontation. There are several ways to extract information from a confrontation object.

By default aggregates are produced by rule.

cf <- check_that(women, height>0, weight>0,height/weight < 0.5)
aggregate(cf) 
##    npass nfail nNA rel.pass rel.fail rel.NA
## V1    15     0   0      1.0      0.0      0
## V2    15     0   0      1.0      0.0      0
## V3    12     3   0      0.8      0.2      0

To aggregate by record, use by='record'

head(aggregate(cf,by='record'))
##   npass nfail nNA  rel.pass  rel.fail rel.NA
## 1     2     1   0 0.6666667 0.3333333      0
## 2     2     1   0 0.6666667 0.3333333      0
## 3     2     1   0 0.6666667 0.3333333      0
## 4     3     0   0 1.0000000 0.0000000      0
## 5     3     0   0 1.0000000 0.0000000      0
## 6     3     0   0 1.0000000 0.0000000      0

Aggregated results can be automatically sorted, so records with the most violations or rules that are violated most sort higher.

# rules with most violations sorting first:
sort(cf)
##    npass nfail nNA rel.pass rel.fail rel.NA
## V3    12     3   0      0.8      0.2      0
## V1    15     0   0      1.0      0.0      0
## V2    15     0   0      1.0      0.0      0

Confrontation objects can be subsetted with single bracket operators (like vectors), to obtain a sub-object pertaining only to the selected rules.

summary(cf[c(1,3)])

Confrontation options

By default, all errors and warnings are caught when validation rules are confronted with data. This can be switched off by setting the raise option to "errors" or "all", e.g. here, the error is caught,

v <- validator(hite > 0, weight>0)
summary(confront(women, v))
##   rule items passes fails nNA error warning expression
## 1   V1     0      0     0   0  TRUE   FALSE   hite > 0
## 2   V2    15     15     0   0 FALSE   FALSE weight > 0

while here it is raised immediately.

# this gives an error
confront(women, v, raise='all')

Linear equalities form an important class of validation rules. To prevent equalities to be strictly tested, there is an option called len.eq.eps (with default value \(10^{-8}\)) that allows one to add some slack to these tests. The amount of slack is intended to prevent false negatives (unneccesary failures) caused by machine rounding. If you want to check whether a sum-rule is satisfied to within one or two units of measurement, it is cleaner to define two inequalities for that.

Investigating validator objects

Validator objects store a set of rules, optionally with some metadata per rule. Currently, the following functions can be used to get or set metadata:

Names can be set from the command line when defining a validator object.

v <- validator(rat = height/weight > 0.5, htest=height>0, wtest=weight > 0)
names(v)
## [1] "rat"   "htest" "wtest"

Also try

names(v)[1] <- "ratio"
v
## Object of class 'validator' with 3 elements:
##  ratio: height/weight > 0.5
##  htest: height > 0
##  wtest: weight > 0

Some general information is obtained with summary,

summary(v)
##   block nvar rules linear
## 1     1    2     3      2

Here, some properties per block of rules is given. Two rules occur in the same block if when they share a variable. In this case, all rules occur in the same block.

and the number of rules can be requested with length

length(v)

With variables, the variables occurring per rule, or over all the rules can be requested.

variables(v)
## [1] "height" "weight"
variables(v,as="matrix")
##        variable
## rule    height weight
##   ratio   TRUE   TRUE
##   htest   TRUE  FALSE
##   wtest  FALSE   TRUE

Validator objects can be subsetted as if they were lists using the single and double bracket operators.

v[c(1,3)]
## Object of class 'validator' with 2 elements:
##  ratio: height/weight > 0.5
##  wtest: weight > 0
## Options:
## raise: none; lin.eq.eps: 1e-08; na.value: NA; sequential: TRUE; na.condition: FALSE
v['ratio','wtest']
## Object of class 'validator' with 1 elements:
##  ratio: height/weight > 0.5
## Options:
## raise: none; lin.eq.eps: 1e-08; na.value: NA; sequential: TRUE; na.condition: FALSE

The double bracket can be used to inspect a single rule

v[[1]]
## 
## Object of class rule.
##  expr       : height/weight > 0.5 
##  name       : ratio 
##  label      :  
##  description:  
##  origin     : command-line 
##  created    : 2017-04-07 22:23:48

Validator objects and confrontation objects are reference objects

As simple as that. If you do

w <- v

for a validator object v, then w just points to the same physical object as v. To make an actual copy, you can select everything.

w <- v[]