A vignette for datacheck package (version 1.0.3)

Reinhard Simon, International Potato Center, Lima, Peru

The library datacheck provides some simple functions to check the consistency of a dataset. It assumes data are available in tabular format - typically a csv file with objects or records in rows and attributes or variables in the columns.

In a database setting the variables would be controlled by the database - at least conformance to types (character, numeric, etc) and allowed min/maximum values. However, often data are gathered in simple spreadsheets or are for other reasons without such constraints. Here, data constraints like allowed types or values, expected values and relationships can be defined using R commands and syntax. This allows much more flexibility and fine grained control. Typically it demands also a lot of domain knowledge from the user. It is therefore often useful to re-use such domain aware rule files across tables with similar content. Therefore this tool is foregiving if rules cannot be executed if a variable is not present in the table to be analyzed allowing the reuse of such rule files.

Using the HTML interface

Use the following commands to copy some example files to your current working directory (uncomment the file.copy command):

atable = system.file("examples/soilsamples.csv", package="datacheck")
srules = system.file("examples/soil_rules.R", package="datacheck")

# Uncomment the next two lines

# file.copy(atable, "soilsamples.csv")
# file.copy(srules, "soil_rules.R")

Then type in the command runDatacheck() in the R editor.

Use the upload buttons to load the respective files in your working directory. Review the results.

Using the command line interface

Assuming you have copied the above mentioned files in your working directory proceed to read in the data.

atable = read.csv(atable, header = TRUE, stringsAsFactors = FALSE)
srules = read.rules(srules)
profil = datadict.profile(atable, srules)

You can inspect a graphical summary of rules per variable:

ruleCoverage(profil)

plot of chunk unnamed-chunk-3

The cumulative number of records with increasing scores.

scoreSum(profil)

plot of chunk unnamed-chunk-4

Or see the tables (only the first 20 records and first 6 columns shown):

xtable(atable[1:20, 1:6])
ID Latitude Longitude Country Adm1 Adm2
1 767 -12.03 -75.24 Peru Junin Huancayo
2 1029 -12.08 -76.95 Peru Lima Lima
3 236 -12.03 -75.24 Peru Junin Huancayo
4 759 -9.28 -77.63 Peru Ancash Carhuaz
5 71 -5.89 -76.11 Peru Loreto Alto Amazonas
6 839 -11.41 -76.34 Peru Junin Yauli
7 1082 -12.08 -76.95 Peru Lima Lima
8 1603 -12.00 -75.22 Peru Junin Huancayo
9 206 -11.12 -75.35 Peru Junin Chanchamayo
10 389 -12.08 -76.95 Peru Lima Lima
11 1317 -15.84 -70.03 Peru Puno Puno
12 922 -12.03 -75.25 Peru Junin Huancayo
13 278 -5.93 -76.13 Peru Loreto Alto Amazonas
14 722 -7.08 -78.34 Peru Cajamarca Cajamarca
15 529 -18.18 -70.47 Peru Tacna Tacna
16 108 -5.89 -76.11 Peru Loreto Alto Amazonas
17 1591 -12.08 -76.95 Peru Lima Lima
18 98 -5.89 -76.11 Peru Loreto Alto Amazonas
19 132 -5.89 -76.11 Peru Loreto Alto Amazonas
20 1598 -12.08 -76.95 Peru Lima Lima

Similarly for the score table; however, this table contains also the total counts of scores by records and variables. In addition, the maximum score by variable.

ps = profil$scores
recs = c(1:10, nrow(ps)-1, nrow(ps))
cols = c(1:4,  ncol(ps))
xtable(ps[recs, cols])
ID Latitude Longitude Country Record.score
1 3.00 2.00 3.00 2.00 35.00
2 3.00 2.00 3.00 2.00 35.00
3 3.00 2.00 3.00 2.00 35.00
4 3.00 2.00 3.00 2.00 35.00
5 3.00 2.00 3.00 2.00 35.00
6 3.00 2.00 3.00 2.00 35.00
7 3.00 2.00 3.00 2.00 35.00
8 3.00 2.00 3.00 2.00 35.00
9 3.00 2.00 3.00 2.00 35.00
10 3.00 2.00 3.00 2.00 35.00
Attribute.score 300.00 200.00 300.00 200.00 3481.00
Rules.per.variable 3.00 2.00 3.00 2.00 35.00

A last visualization is a heatmap of the score table to organize similar records and similar rule profiles to help detect any patterns,

plot of chunk unnamed-chunk-7

Checking tables with data inconsistencies

For comparative purposes we purposely introduce a few errors in our table as below. We also exclude a rule on soil types for better display.

atable$P[1]  = -100
atable$pH[11]= -200
srule1 = srules[-c(33),]
profil = datadict.profile(atable, srule1)

To get a better handle on the data it is always informative to review simple descriptive summaries of the data. A custom summary function is included in the package to display this summary in tabular form:

xtable(shortSummary(atable))
n missing unique value min max Mean sd .05 .10 .25 .50 .75 .90 .95
ID 100 0 100 5 1685 822.6 498.51 96.65 114.70 424.50 838.00 1190.00 1549.60 1605.70
Latitude 100 0 40 -18.1817 -5.8939 -12.23 3.07 -18.182 -18.182 -12.078 -12.028 -11.371 -8.702 -5.894
Longitude 100 0 42 -78.34 -70.0292 -74.87 2.39 -76.95 -76.95 -76.92 -75.35 -75.08 -70.47 -70.03
Country 100 0 1 Peru
Adm1 100 0 11
Adm2 98 2 18
Adm3 99 1 26
pH 100 0 56 -200 8.59 4.402 20.69 3.910 4.390 5.175 7.200 7.500 7.710 8.000
Conductivity 100 0 74 0.05 21.3 1.66 2.58 0.1395 0.1690 0.2400 0.6250 2.3550 4.3460 5.2650
CaCO3 100 0 23 0 47.6 1.573 5.67 0.000 0.000 0.000 0.000 0.220 5.156 10.940
Organic_matter 100 0 70 0.1 36.7 2.681 4.44 0.5475 0.6900 1.0150 1.5000 2.5250 4.6300 8.0150
P 100 0 86 -100 74 17.07 18.56 3.895 4.550 6.800 14.900 22.400 39.240 47.245
Sand 95 5 34 14 98 52.93 16.62 22.0 32.0 43.0 51.0 65.0 72.0 77.2
Lime 95 5 26 2 50 30.14 8.95 18.0 20.0 24.0 30.0 36.0 41.2 44.0
Clay 95 5 22 0 42 16.77 9.37 4.0 6.0 10.0 16.0 22.0 28.0 34.3
Soil_texture 96 4 9
Altitude 100 0 41 78 4417 1722 1603.76 78.0 78.0 235.0 827.5 3299.0 3846.0 3848.0

A summary of the results by rule can be seen from the profil object:

xtable(profil$checks)
Variable Type Rule Comment Execution Error.sum Error.list
1 ID integer sapply(ID, is.integer) None ok 0 none
2 ID integer !duplicated(ID) None ok 0 none
3 ID integer ID > 0 & ID < 1754 None ok 0 none
4 Latitude numeric sapply(Latitude, is.numeric) None ok 0 none
5 Latitude numeric Latitude < 0 None ok 0 none
6 Longitude numeric sapply(Longitude, is.numeric) None ok 0 none
7 Longitude numeric Longitude < 180 & Longitude > -180 None ok 0 none
8 Longitude numeric is.null(Longitude) == is.null(Latitude) None ok 0 none
9 Adm1 character sapply(Adm1, is.character) None ok 0 none
10 Adm2 character sapply(Adm2, is.character) None ok 0 none
11 Adm3 character sapply(Adm3, is.character) None ok 0 none
12 Country character sapply(Country, is.character) None ok 0 none
13 Altitude integer sapply(Altitude, is.integer) None ok 0 none
14 Adm1 character is.null(Adm1) == is.null(Longitude) ok 0 none
15 Adm2 character is.null(Adm2) == is.null(Longitude) None ok 0 none
16 Adm3 character is.null(Adm3) == is.null(Longitude) None ok 0 none
17 Country character is.null(Country) == is.null(Longitude) None ok 0 none
18 Altitude integer is.null(Altitude) == is.null(Longitude) None ok 0 none
19 pH numeric sapply(pH, is.numeric) None ok 0 none
20 pH numeric pH >= 0 pH bigger than ok 1 11
21 pH numeric pH <= 14 pH lesser than ok 0 none
22 Conductivity numeric sapply(Conductivity, is.numeric) None ok 0 none
23 Conductivity numeric Conductivity >= 0 None ok 0 none
24 CaCO3 numeric sapply(CaCO3, is.numeric) None ok 0 none
25 CaCO3 numeric CaCO3 >= 0 None ok 0 none
26 Sand numeric sapply(Sand, is.numeric) None ok 0 none
27 Sand numeric sapply(Sand, is.withinRange, 0, 100) None ok 0 none
28 Lime numeric sapply(Lime, is.numeric) None ok 0 none
29 Lime numeric sapply(Lime, is.withinRange, 0, 100) None ok 0 none
30 Clay numeric sapply(Clay, is.numeric) None ok 0 none
31 Clay numeric sapply(Clay, is.withinRange, 0, 100) None ok 0 none
32 Soil_texture character sapply(Soil_texture, is.character) None ok 0 none
34 P numeric sapply(P, is.numeric) None ok 0 none
35 P numeric P >= 0 None ok 1 1

The checks part lists all erroneous records in the last column for each rule. This may be too long for printing. To this end a custom print report function only displays the first n records where n=5 is the default.

atable$Sand[20:30] = -1
profil = datadict.profile(atable, srule1)
xtable(prep4rep(profil$checks))
Variable Type Rule Comment Execution Error.sum Error.list
1 ID integer sapply(ID, is.integer) None ok 0 none
2 ID integer !duplicated(ID) None ok 0 none
3 ID integer ID > 0 & ID < 1754 None ok 0 none
4 Latitude numeric sapply(Latitude, is.numeric) None ok 0 none
5 Latitude numeric Latitude < 0 None ok 0 none
6 Longitude numeric sapply(Longitude, is.numeric) None ok 0 none
7 Longitude numeric Longitude < 180 & Longitude > -180 None ok 0 none
8 Longitude numeric is.null(Longitude) == is.null(Latitude) None ok 0 none
9 Adm1 character sapply(Adm1, is.character) None ok 0 none
10 Adm2 character sapply(Adm2, is.character) None ok 0 none
11 Adm3 character sapply(Adm3, is.character) None ok 0 none
12 Country character sapply(Country, is.character) None ok 0 none
13 Altitude integer sapply(Altitude, is.integer) None ok 0 none
14 Adm1 character is.null(Adm1) == is.null(Longitude) ok 0 none
15 Adm2 character is.null(Adm2) == is.null(Longitude) None ok 0 none
16 Adm3 character is.null(Adm3) == is.null(Longitude) None ok 0 none
17 Country character is.null(Country) == is.null(Longitude) None ok 0 none
18 Altitude integer is.null(Altitude) == is.null(Longitude) None ok 0 none
19 pH numeric sapply(pH, is.numeric) None ok 0 none
20 pH numeric pH >= 0 pH bigger than ok 1 11
21 pH numeric pH <= 14 pH lesser than ok 0 none
22 Conductivity numeric sapply(Conductivity, is.numeric) None ok 0 none
23 Conductivity numeric Conductivity >= 0 None ok 0 none
24 CaCO3 numeric sapply(CaCO3, is.numeric) None ok 0 none
25 CaCO3 numeric CaCO3 >= 0 None ok 0 none
26 Sand numeric sapply(Sand, is.numeric) None ok 0 none
27 Sand numeric sapply(Sand, is.withinRange, 0, 100) None ok 11 20,21,22,23,24 … more
28 Lime numeric sapply(Lime, is.numeric) None ok 0 none
29 Lime numeric sapply(Lime, is.withinRange, 0, 100) None ok 0 none
30 Clay numeric sapply(Clay, is.numeric) None ok 0 none
31 Clay numeric sapply(Clay, is.withinRange, 0, 100) None ok 0 none
32 Soil_texture character sapply(Soil_texture, is.character) None ok 0 none
34 P numeric sapply(P, is.numeric) None ok 0 none
35 P numeric P >= 0 None ok 1 1

Using rules that can’t be executed

This may happen if the syntax is wrong. Another reason - particularly if re-using rule files across tables - maybe that a particular variable name is not present amongst the column names of the present table. The tool will just ignore it and report a ‘failed’ execution. Let us simply modify an existing rule as below:

srule1$Variable[25] = "caCO3"
srule1$Rule[25] = "caCO3 >= 0"
profil = datadict.profile(atable, srule1)

Now let us just look at an excerpt of the results table:

xtable(prep4rep(profil$checks[20:30,]))
Variable Type Rule Comment Execution Error.sum Error.list
20 pH numeric pH >= 0 pH bigger than ok 1 11
21 pH numeric pH <= 14 pH lesser than ok 0 none
22 Conductivity numeric sapply(Conductivity, is.numeric) None ok 0 none
23 Conductivity numeric Conductivity >= 0 None ok 0 none
24 CaCO3 numeric sapply(CaCO3, is.numeric) None ok 0 none
25 caCO3 numeric caCO3 >= 0 None failed 0 NA
26 Sand numeric sapply(Sand, is.numeric) None ok 0 none
27 Sand numeric sapply(Sand, is.withinRange, 0, 100) None ok 11 20,21,22,23,24 … more
28 Lime numeric sapply(Lime, is.numeric) None ok 0 none
29 Lime numeric sapply(Lime, is.withinRange, 0, 100) None ok 0 none
30 Clay numeric sapply(Clay, is.numeric) None ok 0 none

End of tutorial