User Guide

Sheeja Manchira Krishnan

2020-01-23

library(IPDFileCheck)

IPDFileCheck

IPDFileCheck is a package that can be used to check the data file from a randomised clinical trial (RCT). The standard checks on data file from RCT will be of the following 1. To check the file exists and readable 2. To check if the column exists, 3. To get the column number if the column name is known, 4. To test column contents ie do they contain specific items in a given list? 5. To test column names of a data being different from what specified, 6. To check the format of column ‘age’ in data 7. To check the format of column ‘gender’ in data 8. To check the format of column contents -numeric or string 9. To check the format of a numeric column 10. To return the column number if the pattern is contained in the colnames of a data 11. To return descriptive statistics, sum, no of observations, mean, mode. median, range, standard deviation and standard error 12. To present the mean and sd of a data set in the form Mean (SD) 13. To return a subgroup when certain variable equals the given value while omitting those with NA 14. To estimate standard error of the mean and the mode 15. To find the number and percentages of categories 16. To represent categorical data in the form - numbers (percentage) 17. To calculate age from date of birth and year of birth

Data

For demonstration purposes, two simulated data sets (one with valid data and another with invalid data) representing treatment and control arm of randomised controlled trial will be used.

 set.seed(17)
 rctdata <- data.frame(age=abs(rnorm(10, 60, 20)),
                           sex=c("M", "F","M", "F","M", "F","M", "F","F","F"),
                           yob=sample(seq(1930,2000), 10, replace=T),
                           dob=c("07/12/1969","16/02/1962","03/09/1978","17/02/1969",                                      "25/11/1960","17/04/1970","18/03/1997","30/01/1988",
                                               "03/02/1990","25/09/1978"),
                           arm=c("Control", "Intervention","Control", "Intervention","Control", "Intervention","Control", "Intervention","Control", "Intervention"),stringsAsFactors = FALSE)
 
rctdata_error <- data.frame(age=runif(10, -60, 120),
                           sex=c("M", "F","M", "F","M", "F","M", "F","F","F"),
                           yob=sample(seq(1930,2000), 10, replace=T),
                           dob=c("1997 May 28","1987-June-18",NA,"2015/July/09","1997 May 28","1987-June-18",NA,"2015/July/09","1997 May 28","1987-June-18"),
                           arm=c("Control", "Intervention","Control", "Intervention","Control", "Intervention","Control", "Intervention","Control", "Intervention"),stringsAsFactors = FALSE)

Examples- IPDFileCheck

1. To check the file exists and readable

The function test_file_exist_read tests if the user provided file exists and is readable. Returns 0 for success and errors for failures. Here in the example a directory named “nodir” is generated and tested.

  thisfile = system.file("extdata", "blank.txt", package = "IPDFileCheck")
  test_file_exist_read(thisfile)
#> [1] 0

2. To check if the column exists

The function check_column_exists tests if the column with user specified column name exists in the data. For example in the above simulated data set ‘rctdata’, the column with column name ‘sex’ exists but ‘gender’ do not. Thus the function returns 0 when ‘sex’ is used while returns error when ‘gender’ is used.

check_column_exists("sex",rctdata)
#> [1] 0

3. To get the column number if the column name is known

The function get_columnno_fornames returns the column number of the column with user specified column name in the data. For example in the above simulated data set ‘rctdata’, the column with column name ‘sex’ exists and it is the 2nd column but ‘gender’ do not. Thus the function returns column number 2 when ‘sex’ is used while returns error when ‘gender’ is used.

get_columnno_fornames(rctdata,"sex")
#> [1] 2

4. To test column contents ie do they contain specific items in a given list?

The function test_column_contents tests if the column contents are from a list provided by the user. The user can also give an optional code that correspond to the non response in the data. In the simulated data shown above the column ‘sex’ contains ‘M’ and ‘F’ as the entries and we can test this as shown below. If the entries are of the given format the function returns 0 else indicates error.

test_column_contents(rctdata,"sex",c("M","F"),NA)
#> [1] 0
test_column_contents(rctdata,"sex",c("M","F"))
#> [1] 0

5. To test column names of a data being different from what specified

The function test_columnnames tests if the column names in the data are that provided by the user. In the simulated data ‘rctdata’, shown above the column ‘sex’ contains ‘M’ and ‘F’ as the entries and we can test this as shown below. If the entries are of the given format the function returns 0 else indicates error.

test_columnnames(c("age","sex","dob","yob","arm"),rctdata)
#> [1] 0

6. To check the format of column ‘age’ in data

The function test_age tests if the contents of the column ‘age’ is valid. User can provide the name of the column and the optional code of non response. Age should be numeric and with in limits of 0 and 150. In the simulated data ‘rctdata’ the ‘age’ column contents are valid thus returning 0. But with the given dataset ‘rctdata_error’ the age can have negative numbers, such that the function returns an error .

test_age(rctdata,"age",NA)
#> [1] 0

7. To check the format of column ‘gender’ in data

The function test_gender tests if the contents of the gender column is valid. User provides the name of the gender column, how it is coded, and the optional code of non response. In the simulated data ‘rctdata’ the gender column name is ‘sex’ and coded as ‘M’ and ‘F’. Thus the function returns 0. but if the user tells that the gender is coded as “Male” and “Female” the function returns error.

test_gender(rctdata,c("M","F"),"sex",NA)
#> [1] 0

8. To check the format of column contents -numeric or string

The function test_data_numeric tests if the column contents are numeric. User provides the minimum and maximum values the numeric values in the column can have along with an optional code that suggests the non response. If the entries are numeric, format the function returns 0 else indicates error In the ‘rctdata’ above,The age is from 0 to 100, hence it returns 0, while the year of birth column “yob” has values greater than 100, hence returning an error.

test_data_numeric("age",rctdata,NA,0,100)
#> [1] 0

The function test_data_numeric_norange tests if the column contents are numeric (but with no ranges provided). User can provide with an optional code that suggests the non response.If the entries are numeric, format the function returns 0 else indicates error As the column “arm” has no numeric data in ‘rctdata’, the function returns error.

test_data_numeric_norange("age",rctdata,NA)
#> [1] 0
test_data_numeric_norange("yob",rctdata,NA)
#> [1] 0

The function test_data_string tests if the column contents are string. User can provide with an optional code that suggests the non response.If the entries are numeric, format the function returns 0 else indicates error As the column “arm” has no numeric data in ‘rctdata’, the function returns 0 and ‘yob’ with numeric data, indicates error.

test_data_string(rctdata,"arm",NA)
#> [1] 0

The function test_data_string_restriction tests if the column contents are string but with given restrictions. User can provide with an optional code that suggests the non response.If the entries are numeric, format the function returns 0 else indicates error. As the column “arm” has no numeric data in ‘rctdata’ and they contain the entries as specified, the function returns 0. But the column ‘sex’ contain “M” and “F” other than “Male” and “Female”.

test_data_string_restriction(rctdata,"arm",NA,c("Intervention","Control"))
#> [1] 0
test_data_string_restriction(rctdata,"sex",NA,c("M","F"))
#> [1] 0

9 To return column number if the pattern is contained in the column names of data

The function get_colno_pattern_colname returns the column number of the column with column name that contains user specified pattern in the data. For example in the above simulated data set ‘rctdata’, the column with column name ‘dob’ and ‘yob’ contains the pattern ‘ob’ and they are the 4th and 5th columns but ‘gender’ do not exist in the data (to return error).

get_colno_pattern_colname("ob",colnames(rctdata))
#> [1] 3 4

10. To return descriptive statistics, sum, no of observations, mean, mode. median, range, standard deviation and standard error

The function descriptive_stats_col returns the descriptive statistics of the column with the user specified column name. This includes mean, standard deviation, median, mode, standard error f the mean, minimum and maximum values to the 95% confidence intervals. If the column contents are not numeric or any error in calculating any of the quantities, the function returns an error. For example, the column ‘age’ is numeric and can return the descriptive statistics, but the column ‘sex’ is not. Hence the function returns an error.

descriptive_stats_col(rctdata, "age")
#>          Sum     Mean       SD Median     Mode       SE  Minimum  Maximum
#> age 635.4561 63.54561 16.54699 61.756 39.69983 5.232617 39.69983 94.33068
#>     Count       LQ       UQ 95%CI.low 95%CI.high
#> age    10 55.67713 73.41427  40.58966   90.98421

11. To present the mean and sd of a data set in the form Mean (SD)

The function present_mean_sd_rmna_text returns the mean and SD in the form mean (SD). If the column contents are not numeric or any error in calculating, the function returns an error. For example, the column ‘age’ is numeric and can return the mean and SD, but the column ‘sex’ is not. Hence the function returns an error.

present_mean_sd_rmna_text(rctdata, "age")
#> [1] "63.55 (16.55)"

12. To return a subgroup when certain variable equals the given value while omitting those with NA

The function return_subgroup_omitna returns the subgroup using the user defined condition while omitting any non response values. mean and SD in the form mean (SD). If the column contents are not numeric or any error in calculating, the function returns an error. For example, the first command below gives all the female in the data, while the second command retrieves all those in control arm.

return_subgroup_omitna(rctdata, "sex","F")
#>         age sex  yob        dob          arm
#> 2  58.40727   F 1973 16/02/1962 Intervention
#> 4  43.65464   F 1987 17/02/1969 Intervention
#> 6  56.68776   F 1936 17/04/1970 Intervention
#> 8  94.33068   F 1970 30/01/1988 Intervention
#> 9  65.10474   F 1997 03/02/1990      Control
#> 10 67.33162   F 1945 25/09/1978 Intervention
return_subgroup_omitna(rctdata, "arm","control")
#> [1] age sex yob dob arm
#> <0 rows> (or 0-length row.names)

13. To find the number and percentages of categorical data

The function represent_categorical_textdata returns the descriptive statistics using number and percentage in a categorical column.User provides the number of categories, how it is coded, and the column name. For example it returns the number and percentage of “M” and “F” in the column “sex” or the number and percentage of “Intervention” and “Control” in the column “arm”

represent_categorical_textdata(rctdata, "sex",NA)
#> [1] 1
#> [1] 2
#>        F        M 
#> "6 (60)" "4 (40)"
represent_categorical_textdata(rctdata, "arm",NA)
#> [1] 1
#> [1] 2
#>      CONTROL INTERVENTION 
#>     "5 (50)"     "5 (50)"

14. To calculate age from date of birth and year of birth

The function calculate_age_from_dob returns the age calculated from given date of birth. User may provide the column name containing date of birth, format of date of birth and optional non response code.For example, in the ‘rctdata’ shown above, the ‘dob’ column has dates in the format “%d/%m/%y”. The allowed formats for the dates should be in numeric. For example, in the rctdata_error, the dates are in combined numeric and text format, which will return an error.

calculate_age_from_dob(rctdata,"dob","%d/%m/%y",NA)
#>         age sex  yob        dob          arm calc_age_dob
#> 1  39.69983   M 1992 07/12/1969      Control     50.12860
#> 2  58.40727   F 1973 16/02/1962 Intervention     57.93408
#> 3  55.34026   M 1982 03/09/1978      Control     41.38888
#> 4  43.65464   F 1987 17/02/1969 Intervention     50.93134
#> 5  75.44182   M 1994 25/11/1960      Control     59.16393
#> 6  56.68776   F 1936 17/04/1970 Intervention     49.76970
#> 7  79.45749   M 1970 18/03/1997      Control     22.85189
#> 8  94.33068   F 1970 30/01/1988 Intervention     31.98087
#> 9  65.10474   F 1997 03/02/1990      Control     29.96970
#> 10 67.33162   F 1945 25/09/1978 Intervention     41.32860

The function calculate_age_from_year returns the age calculated from given year of birth. User may provide the column name containing date of birth and optional non response code.For example, in the ‘rctdata’ shown above, the ‘yob’ column has birth year.

calculate_age_from_year(rctdata,"yob",NA)
#>         age sex  yob        dob          arm calc.age.yob
#> 1  39.69983   M 1992 07/12/1969      Control           28
#> 2  58.40727   F 1973 16/02/1962 Intervention           47
#> 3  55.34026   M 1982 03/09/1978      Control           38
#> 4  43.65464   F 1987 17/02/1969 Intervention           33
#> 5  75.44182   M 1994 25/11/1960      Control           26
#> 6  56.68776   F 1936 17/04/1970 Intervention           84
#> 7  79.45749   M 1970 18/03/1997      Control           50
#> 8  94.33068   F 1970 30/01/1988 Intervention           50
#> 9  65.10474   F 1997 03/02/1990      Control           23
#> 10 67.33162   F 1945 25/09/1978 Intervention           75