vtreat package

John Mount, Nina Zumel

2019-07-16

‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. A formal article on the method can be found here: arXiv:1611.09477 stat.AP.

A ‘vtreat’ clean data frame:

To achieve this a number of techniques are used. Principally:

For more details see: the ‘vtreat’ article and update.

The use pattern is:

  1. Use designTreatmentsC() or designTreatmentsN() to design a treatment plan
  2. Use the returned structure with prepare() to apply the plan to data frames.

The main feature of ‘vtreat’ is that all data preparation is “y-aware”: it uses the relations of effective variables to the dependent or outcome variable to encode the effective variables.

The structure returned from designTreatmentsN() or designTreatmentsC() includes a list of “treatments”: objects that encapsulate the transformation process from the original variables to the new “clean” variables.

In addition to the treatment objects designTreatmentsC() and designTreatmentsN() also return a data frame named scoreFrame which contains columns:

In all cases we have two undesirable upward biases on the scores:

‘vtreat’ uses a number of cross-training and jackknife style procedures to try to mitigate these effects. The suggested best practice (if you have enough data) is to split your data randomly into at least the following disjoint data sets:

Taking the extra step to perform the designTreatmentsC() or designTreatmentsN() on data disjoint from training makes the training data more exchangeable with test and avoids the issue that ‘vtreat’ may be hiding a large number of degrees of freedom in variables it derives from large categoricals.

Some trivial execution examples (not demonstrating any cal/train/test split) are given below. Variables that do not move during hold-out testing are considered “not to move.”


A Categorical Outcome Example

library(vtreat)
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
   z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
head(dTrainC)
##      x  z     y
## 1    a  1 FALSE
## 2    a  2 FALSE
## 3    a  3  TRUE
## 4    b  4 FALSE
## 5    b NA  TRUE
## 6 <NA>  6  TRUE
dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestC)
##      x  z
## 1    a 10
## 2    b 20
## 3    c 30
## 4 <NA> NA
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
## [1] "vtreat 1.4.3 inspecting inputs Tue Jul 16 10:02:18 2019"
## [1] "designing treatments Tue Jul 16 10:02:18 2019"
## [1] " have initial level statistics Tue Jul 16 10:02:18 2019"
## [1] " scoring treatments Tue Jul 16 10:02:18 2019"
## [1] "have treatment plan Tue Jul 16 10:02:18 2019"
## [1] "rescoring complex variables Tue Jul 16 10:02:18 2019"
## [1] "done rescoring complex variables Tue Jul 16 10:02:18 2019"
print(treatmentsC)
##   origName   varName  code        rsq       sig extraModelDegrees
## 1        x    x_catP  catP 0.24340634 0.1547700                 2
## 2        x    x_catB  catB 0.05070201 0.5160763                 2
## 3        z         z clean 0.25792985 0.1429977                 0
## 4        z   z_isBAD isBAD 0.19087450 0.2076623                 0
## 5        x  x_lev_NA   lev 0.19087450 0.2076623                 0
## 6        x x_lev_x_a   lev 0.08170417 0.4097258                 0
## 7        x x_lev_x_b   lev 0.00000000 1.0000000                 0
print(treatmentsC$treatments[[1]])
## [1] "vtreat 'Categoric Indicators'('x'(integer,factor)->character->'x_lev_NA','x_lev_x_a','x_lev_x_b')"

Here we demonstrate the optional scaling feature of prepare(), which scales and centers all significant variables to mean 0, and slope 1 with respect to y: In other words, it re-scales the variables to “y-units”. This is useful for downstream principal components analysis. Note: variables perfectly uncorrelated with y necessarily have slope 0 and can’t be “scaled” to slope 1, however for the same reason these variables will be insignificant and can be pruned by pruneSig.

scale=FALSE by default.

dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=TRUE)
head(dTrainCTreated)
##       x_catP    x_catB          z   z_isBAD  x_lev_NA  x_lev_x_a x_lev_x_b
## 1 -0.9396225 -1.894112 -2.2158976 -3.161922 -3.161922 -0.6931472         0
## 2 -0.9396225 -1.894112 -1.2086714 -3.161922 -3.161922 -0.6931472         0
## 3 -0.9396225 -1.894112 -0.2014452 -3.161922 -3.161922 -0.6931472         0
## 4  0.4698112 -1.196414  0.8057809 -3.161922 -3.161922  0.6931472         0
## 5  0.4698112 -1.196414  0.0000000 15.809611 -3.161922  0.6931472         0
## 6  1.8792449  8.075166  2.8202333 -3.161922 15.809611  0.6931472         0
##       y
## 1 FALSE
## 2 FALSE
## 3  TRUE
## 4 FALSE
## 5  TRUE
## 6  TRUE
varsC <- setdiff(colnames(dTrainCTreated),'y')
# all input variables should be mean 0
sapply(dTrainCTreated[,varsC,drop=FALSE],mean)
##       x_catP       x_catB            z      z_isBAD     x_lev_NA 
## 3.700743e-16 7.401487e-17 7.408715e-17 2.220446e-16 0.000000e+00 
##    x_lev_x_a    x_lev_x_b 
## 0.000000e+00 0.000000e+00
# all slopes should be 1 for variables with dTrainCTreated$scoreFrame$sig<1
sapply(varsC,function(c) { glm(paste('y',c,sep='~'),family=binomial,
   data=dTrainCTreated)$coefficients[[2]]})
##    x_catP    x_catB         z   z_isBAD  x_lev_NA x_lev_x_a x_lev_x_b 
##         1         1         1         1         1         1        NA
dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=c(),scale=TRUE)
head(dTestCTreated)
##       x_catP    x_catB         z   z_isBAD  x_lev_NA  x_lev_x_a x_lev_x_b
## 1 -0.9396225 -1.894112  6.849138 -3.161922 -3.161922 -0.6931472         0
## 2  0.4698112 -1.196414 16.921400 -3.161922 -3.161922  0.6931472         0
## 3  2.5839618 -1.196414 26.993662 -3.161922 -3.161922  0.6931472         0
## 4  1.8792449  8.075166  0.000000 15.809611 15.809611  0.6931472         0

A Numeric Outcome Example

# numeric example
dTrainN <- data.frame(x=c('a','a','a','a','b','b',NA),
   z=c(1,2,3,4,5,NA,7),y=c(0,0,0,1,0,1,1))
head(dTrainN)
##   x  z y
## 1 a  1 0
## 2 a  2 0
## 3 a  3 0
## 4 a  4 1
## 5 b  5 0
## 6 b NA 1
dTestN <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestN)
##      x  z
## 1    a 10
## 2    b 20
## 3    c 30
## 4 <NA> NA
treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y')
## [1] "vtreat 1.4.3 inspecting inputs Tue Jul 16 10:02:18 2019"
## [1] "designing treatments Tue Jul 16 10:02:18 2019"
## [1] " have initial level statistics Tue Jul 16 10:02:18 2019"
## [1] " scoring treatments Tue Jul 16 10:02:18 2019"
## [1] "have treatment plan Tue Jul 16 10:02:18 2019"
## [1] "rescoring complex variables Tue Jul 16 10:02:18 2019"
## [1] "done rescoring complex variables Tue Jul 16 10:02:18 2019"
print(treatmentsN)
##   origName   varName  code         rsq       sig extraModelDegrees
## 1        x    x_catP  catP 0.075853018 0.5499714                 2
## 2        x    x_catN  catN 0.220417468 0.2878285                 2
## 3        x    x_catD  catD 0.173611111 0.3524132                 2
## 4        z         z clean 0.336111111 0.1724763                 0
## 5        z   z_isBAD isBAD 0.222222222 0.2855909                 0
## 6        x  x_lev_NA   lev 0.222222222 0.2855909                 0
## 7        x x_lev_x_a   lev 0.173611111 0.3524132                 0
## 8        x x_lev_x_b   lev 0.008333333 0.8456711                 0
dTrainNTreated <- prepare(treatmentsN,dTrainN,
                          pruneSig=c(),scale=TRUE)
head(dTrainNTreated)
##   x_catP      x_catN     x_catD           z    z_isBAD   x_lev_NA
## 1   -0.2 -0.17857143 -0.1785714 -0.41904762 -0.0952381 -0.0952381
## 2   -0.2 -0.17857143 -0.1785714 -0.26190476 -0.0952381 -0.0952381
## 3   -0.2 -0.17857143 -0.1785714 -0.10476190 -0.0952381 -0.0952381
## 4   -0.2 -0.17857143 -0.1785714  0.05238095 -0.0952381 -0.0952381
## 5    0.2  0.07142857  0.2380952  0.20952381 -0.0952381 -0.0952381
## 6    0.2  0.07142857  0.2380952  0.00000000  0.5714286 -0.0952381
##    x_lev_x_a   x_lev_x_b y
## 1 -0.1785714 -0.02857143 0
## 2 -0.1785714 -0.02857143 0
## 3 -0.1785714 -0.02857143 0
## 4 -0.1785714 -0.02857143 1
## 5  0.2380952  0.07142857 0
## 6  0.2380952  0.07142857 1
varsN <- setdiff(colnames(dTrainNTreated),'y')
# all input variables should be mean 0
sapply(dTrainNTreated[,varsN,drop=FALSE],mean) 
##        x_catP        x_catN        x_catD             z       z_isBAD 
## -5.551115e-17 -3.965082e-18 -9.515810e-17  4.757324e-17 -3.967986e-18 
##      x_lev_NA     x_lev_x_a     x_lev_x_b 
## -3.965082e-18  0.000000e+00 -2.974054e-18
# all slopes should be 1 for variables with treatmentsN$scoreFrame$sig<1
sapply(varsN,function(c) { lm(paste('y',c,sep='~'),
   data=dTrainNTreated)$coefficients[[2]]}) 
##    x_catP    x_catN    x_catD         z   z_isBAD  x_lev_NA x_lev_x_a 
##         1         1         1         1         1         1         1 
## x_lev_x_b 
##         1
# prepared frame
dTestNTreated <- prepare(treatmentsN,dTestN,
                         pruneSig=c())
head(dTestNTreated)
##       x_catP      x_catN    x_catD         z z_isBAD x_lev_NA x_lev_x_a
## 1 0.57142857 -0.17857143 0.5000000 10.000000       0        0         1
## 2 0.28571429  0.07142857 0.7071068 20.000000       0        0         0
## 3 0.07142857  0.00000000 0.7071068 30.000000       0        0         0
## 4 0.14285714  0.57142857 0.7071068  3.666667       1        1         0
##   x_lev_x_b
## 1         0
## 2         1
## 3         0
## 4         0
# scaled prepared frame
dTestNTreatedS <- prepare(treatmentsN,dTestN,
                         pruneSig=c(),scale=TRUE)
head(dTestNTreatedS)
##   x_catP        x_catN     x_catD         z    z_isBAD   x_lev_NA
## 1   -0.2 -1.785714e-01 -0.1785714 0.9952381 -0.0952381 -0.0952381
## 2    0.2  7.142857e-02  0.2380952 2.5666667 -0.0952381 -0.0952381
## 3    0.5 -1.586033e-17  0.2380952 4.1380952 -0.0952381 -0.0952381
## 4    0.4  5.714286e-01  0.2380952 0.0000000  0.5714286  0.5714286
##    x_lev_x_a   x_lev_x_b
## 1 -0.1785714 -0.02857143
## 2  0.2380952  0.07142857
## 3  0.2380952 -0.02857143
## 4  0.2380952 -0.02857143

Related work: