Working with rsets

Introduction

rsample can be used to create objects containing resamples of the original data. This page contains examples of how those objects can be used for data analysis.

For illustration, the attrition data is used. From the help file:

These data are from the IBM Watson Analytics Lab. The website describes the data with “Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.” There are 1470 rows.

The data can be accessed using

library(rsample)
data("attrition")
names(attrition)
#>  [1] "Age"                      "Attrition"               
#>  [3] "BusinessTravel"           "DailyRate"               
#>  [5] "Department"               "DistanceFromHome"        
#>  [7] "Education"                "EducationField"          
#>  [9] "EnvironmentSatisfaction"  "Gender"                  
#> [11] "HourlyRate"               "JobInvolvement"          
#> [13] "JobLevel"                 "JobRole"                 
#> [15] "JobSatisfaction"          "MaritalStatus"           
#> [17] "MonthlyIncome"            "MonthlyRate"             
#> [19] "NumCompaniesWorked"       "OverTime"                
#> [21] "PercentSalaryHike"        "PerformanceRating"       
#> [23] "RelationshipSatisfaction" "StockOptionLevel"        
#> [25] "TotalWorkingYears"        "TrainingTimesLastYear"   
#> [27] "WorkLifeBalance"          "YearsAtCompany"          
#> [29] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
#> [31] "YearsWithCurrManager"
table(attrition$Attrition)
#> 
#>   No  Yes 
#> 1233  237

Model Assessment

Let’s fit a logistic regression model to the data with model terms for the job satisfaction, gender, and monthly income.

If we were fitting the model to the entire data set, we might model attrition using

glm(Attrition ~ JobSatisfaction + Gender + MonthlyIncome, data = attrition, family = binomial)

For convenience, we’ll create a formula object that will be used later:

mod_form <- as.formula(Attrition ~ JobSatisfaction + Gender + MonthlyIncome)

To evaluate this model, we will use 10 repeats of 10-fold cross-validation and use the 100 holdout samples to evaluate the overall accuracy of the model.

First, let’s make the splits of the data:

library(rsample)
set.seed(4622)
rs_obj <- vfold_cv(attrition, V = 10, repeats = 10)
rs_obj
#> # 10-fold cross-validation repeated 10 times 
#> # A tibble: 100 x 3
#>          splits       id    id2
#>          <list>    <chr>  <chr>
#>  1 <S3: rsplit> Repeat01 Fold01
#>  2 <S3: rsplit> Repeat01 Fold02
#>  3 <S3: rsplit> Repeat01 Fold03
#>  4 <S3: rsplit> Repeat01 Fold04
#>  5 <S3: rsplit> Repeat01 Fold05
#>  6 <S3: rsplit> Repeat01 Fold06
#>  7 <S3: rsplit> Repeat01 Fold07
#>  8 <S3: rsplit> Repeat01 Fold08
#>  9 <S3: rsplit> Repeat01 Fold09
#> 10 <S3: rsplit> Repeat01 Fold10
#> # ... with 90 more rows

Now let’s write a function that will, for each resample:

  1. obtain the analysis data set (i.e. the 90% used for modeling)
  2. fit a logistic regression model
  3. predict the assessment data (the other 10% not used for the model) using the broom package
  4. determine if each sample was predicted correctly.

Here is our function:

library(broom)
## splits will be the `rsplit` object with the 90/10 partition
holdout_results <- function(splits, ...) {
  # Fit the model to the 90%
  mod <- glm(..., data = analysis(splits), family = binomial)
  # Save the 10%
  holdout <- assessment(splits)
  # `augment` will save the predictions with the holdout data set
  res <- broom::augment(mod, newdata = holdout)
  # Class predictions on the assessment set from class probs
  lvls <- levels(holdout$Attrition)
  predictions <- factor(ifelse(res$.fitted > 0, lvls[2], lvls[1]),
                        levels = lvls)
  # Calculate whether the prediction was correct
  res$correct <- predictions == holdout$Attrition
  # Return the assessment data set with the additional columns
  res
}

For example:

example <- holdout_results(rs_obj$splits[[1]],  mod_form)
dim(example)
#> [1] 147  35
dim(assessment(rs_obj$splits[[1]]))
#> [1] 147  31
## newly added columns:
example[1:10, setdiff(names(example), names(attrition))]
#>    .rownames .fitted .se.fit correct
#> 1          7  -1.282   0.182    TRUE
#> 2         14  -1.156   0.187    TRUE
#> 3         32  -3.247   0.353    TRUE
#> 4         54  -0.934   0.193    TRUE
#> 5         72  -1.970   0.193    TRUE
#> 6         74  -1.672   0.188    TRUE
#> 7         76  -2.103   0.171    TRUE
#> 8        138  -1.510   0.165    TRUE
#> 9        142  -1.543   0.149   FALSE
#> 10       147  -1.583   0.150   FALSE

For this model, the .fitted value is the linear predictor in log-odds units.

To compute this data set for each of the 100 resamples, we’ll use the map function from the purrr package:

library(purrr)
rs_obj$results <- map(rs_obj$splits,
                      holdout_results,
                      mod_form)
rs_obj
#> # 10-fold cross-validation repeated 10 times 
#> # A tibble: 100 x 4
#>          splits       id    id2                 results
#>          <list>    <chr>  <chr>                  <list>
#>  1 <S3: rsplit> Repeat01 Fold01 <data.frame [147 x 35]>
#>  2 <S3: rsplit> Repeat01 Fold02 <data.frame [147 x 35]>
#>  3 <S3: rsplit> Repeat01 Fold03 <data.frame [147 x 35]>
#>  4 <S3: rsplit> Repeat01 Fold04 <data.frame [147 x 35]>
#>  5 <S3: rsplit> Repeat01 Fold05 <data.frame [147 x 35]>
#>  6 <S3: rsplit> Repeat01 Fold06 <data.frame [147 x 35]>
#>  7 <S3: rsplit> Repeat01 Fold07 <data.frame [147 x 35]>
#>  8 <S3: rsplit> Repeat01 Fold08 <data.frame [147 x 35]>
#>  9 <S3: rsplit> Repeat01 Fold09 <data.frame [147 x 35]>
#> 10 <S3: rsplit> Repeat01 Fold10 <data.frame [147 x 35]>
#> # ... with 90 more rows

Now we can compute the accuracy values for all of the assessment data sets:

rs_obj$accuracy <- map_dbl(rs_obj$results, function(x) mean(x$correct))
summary(rs_obj$accuracy)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   0.748   0.816   0.837   0.839   0.864   0.898

Keep in mind that the baseline accuracy to beat is the rate of non-attrition, which is 0.839. Not a great model so far.

Using the Bootstrap to Make Comparisons

Traditionally, the bootstrap has been primarily used to empirically determine the sampling distribution of a test statistic. Given a set of samples with replacement, a statistic can be calculated on each analysis set and the results can be used to make inferences (such as confidence intervals).

For example, are there differences in the median monthly income between genders?

ggplot(attrition, aes(x = Gender, y = MonthlyIncome)) + 
  geom_boxplot() + 
  scale_y_log10()

If we wanted to compare the genders, we could conduct a t-test or rank-based test. Instead, let’s use the bootstrap to see if there is a difference in the median incomes for the two groups. We need a simple function to compute this statistic on the resample:

median_diff <- function(splits) {
  x <- analysis(splits)
  median(x$MonthlyIncome[x$Gender == "Female"]) - 
      median(x$MonthlyIncome[x$Gender == "Male"])     
}

Now we would create a large number of bootstrap samples (say 2000+). For illustration, we’ll only do 500 in this document.

set.seed(353)
bt_resamples <- bootstraps(attrition, times = 500)

This function is then computed across each resample:

bt_resamples$statistic <- map_dbl(bt_resamples$splits, median_diff)

The bootstrap distribution of this statistic has a slightly bimodal and skewed distribution:

ggplot(bt_resamples, aes(x = statistic)) + 
  geom_line(stat = "density", adjust = 1.25) + 
  xlab("Difference in Median Monthly Income (Female - Male)")

The variation is considerable in this statistic. One method of computing a confidence interval is to take the percentiles of the bootstrap distribution. A 95% confidence interval for the difference in the means would be:

quantile(bt_resamples$statistic, 
         probs = c(0.025, 0.500, 0.975))
#>  2.5%   50% 97.5% 
#>  -207   190   618

On average, there is no evidence for a difference in the genders.

Bootstrap Estimates of Model Coefficients

Unless there is already a column in the resample object that contains the fitted model, a function can be used to fit the model and save all of the model coefficients. The broom package package has a tidy function that will save the coefficients in a data frame. Instead of returning a data frame with a row for each model term, we will save a data frame with a single row and columns for each model term. As before, purrr::map can be used to estimate and save these values for each split.

glm_coefs <- function(splits, ...) {
  ## use `analysis` or `as.data.frame` to get the analysis data
  mod <- glm(..., data = analysis(splits), family = binomial)
  as.data.frame(t(coef(mod)))
}
bt_resamples$betas <- map(.x = bt_resamples$splits, 
                          .f = glm_coefs, 
                          mod_form)
bt_resamples
#> # Bootstrap sampling with 500 resamples 
#> # A tibble: 500 x 4
#>          splits           id statistic                betas
#>          <list>        <chr>     <dbl>               <list>
#>  1 <S3: rsplit> Bootstrap001      80.0 <data.frame [1 x 6]>
#>  2 <S3: rsplit> Bootstrap002      68.5 <data.frame [1 x 6]>
#>  3 <S3: rsplit> Bootstrap003      20.0 <data.frame [1 x 6]>
#>  4 <S3: rsplit> Bootstrap004     177.0 <data.frame [1 x 6]>
#>  5 <S3: rsplit> Bootstrap005      18.0 <data.frame [1 x 6]>
#>  6 <S3: rsplit> Bootstrap006     594.5 <data.frame [1 x 6]>
#>  7 <S3: rsplit> Bootstrap007     137.0 <data.frame [1 x 6]>
#>  8 <S3: rsplit> Bootstrap008      55.0 <data.frame [1 x 6]>
#>  9 <S3: rsplit> Bootstrap009      32.5 <data.frame [1 x 6]>
#> 10 <S3: rsplit> Bootstrap010     451.0 <data.frame [1 x 6]>
#> # ... with 490 more rows
bt_resamples$betas[[1]]
#>   (Intercept) JobSatisfaction.L JobSatisfaction.Q JobSatisfaction.C
#> 1      -0.828            -0.297            0.0979            -0.259
#>   GenderMale MonthlyIncome
#> 1      0.175     -0.000135

Using recipes

The recipes package contains a data preprocessor that can be used to avoid the potentially expensive formula methods as well as providing a richer set of data manipulation tools than base R can provide.

To define the design matrix, an initial recipe is created:

library(recipes)

rec <- recipe(Attrition ~ Age + Gender + JobRole + JobSatisfaction + MonthlyIncome, 
              data = attrition) %>%
  ## Collapse rarely occurring cities into "other"
  step_other(JobRole, threshold = 0.1) %>%
  ## Dummy variables on the qualitative predictors 
  step_dummy(JobRole, Gender) %>%
  step_log(MonthlyIncome) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) 
rec
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          5
#> 
#> Steps:
#> 
#> Collapsing factor levels for JobRole
#> Dummy variables from JobRole, Gender
#> Log transformation on MonthlyIncome
#> Centering for all_predictors()
#> Scaling for all_predictors()

This recreates the work that the formula method traditionally uses with the additional steps that center and scale the predictors. step_other finds infrequently occurring job roles and collapses them into a factor level called “other” to avoid columns that have all zeros for their dummy variables.

While the original data object attrition is used in the call, it is only used to define the variables and their characteristics so a single recipe is valid across all resampled versions of the data. The recipe can be estimated on the analysis component of the resample.

A function to fit the logistic regression models using a recipe would then be:

glm_coefs_rec <- function(splits, rec, ...) {

  # Estimate the parameters using the analysis data
  trained_rec <- prepare(rec, training = analysis(splits), 
                         retain = TRUE,
                         verbose = FALSE)
  
  # Apply the transformations to the data set, save the 
  # predictor values, and convert to a matrix
  design_matrix <- juice(trained_rec, all_predictors())
  design_matrix <- as.matrix(design_matrix)
  
  # Get the outcome values and fit the model using `lm.fit`
  y <- juice(trained_rec, all_outcomes())$Attrition
  
  glm.fit(x = design_matrix, y = y, family = binomial(), ...)
}

bt_resamples$fits <- map(bt_resamples$splits, glm_coefs_rec, rec = rec)
bt_resamples
#> # Bootstrap sampling with 500 resamples 
#> # A tibble: 500 x 5
#>          splits           id statistic                betas        fits
#>          <list>        <chr>     <dbl>               <list>      <list>
#>  1 <S3: rsplit> Bootstrap001      80.0 <data.frame [1 x 4]> <list [20]>
#>  2 <S3: rsplit> Bootstrap002      68.5 <data.frame [1 x 4]> <list [20]>
#>  3 <S3: rsplit> Bootstrap003      20.0 <data.frame [1 x 4]> <list [20]>
#>  4 <S3: rsplit> Bootstrap004     177.0 <data.frame [1 x 4]> <list [20]>
#>  5 <S3: rsplit> Bootstrap005      18.0 <data.frame [1 x 4]> <list [20]>
#>  6 <S3: rsplit> Bootstrap006     594.5 <data.frame [1 x 4]> <list [20]>
#>  7 <S3: rsplit> Bootstrap007     137.0 <data.frame [1 x 4]> <list [20]>
#>  8 <S3: rsplit> Bootstrap008      55.0 <data.frame [1 x 4]> <list [20]>
#>  9 <S3: rsplit> Bootstrap009      32.5 <data.frame [1 x 4]> <list [20]>
#> 10 <S3: rsplit> Bootstrap010     451.0 <data.frame [1 x 4]> <list [20]>
#> # ... with 490 more rows

From these objects, we can extract the coefficients, fit statistics, and other values.

Keeping Tidy

As previously mentioned, the broom package contains a class called tidy that created representations of objects that can be easily used for analysis, plotting, etc. rsample contains tidy methods for rset and rsplit objects. For example:

first_resample <- bt_resamples$splits[[1]]
class(first_resample)
#> [1] "rsplit"     "boot_split"
tidy(first_resample)
#> # A tibble: 1,470 x 2
#>      Row     Data
#>    <int>    <chr>
#>  1     1 Analysis
#>  2     4 Analysis
#>  3     5 Analysis
#>  4     9 Analysis
#>  5    14 Analysis
#>  6    15 Analysis
#>  7    16 Analysis
#>  8    22 Analysis
#>  9    23 Analysis
#> 10    24 Analysis
#> # ... with 1,460 more rows

and

class(bt_resamples)
#> [1] "bootstraps" "rset"       "tbl_df"     "tbl"        "data.frame"
tidy(bt_resamples)
#> # A tibble: 735,000 x 3
#>      Row     Data     Resample
#>    <int>    <chr>        <chr>
#>  1     1 Analysis Bootstrap001
#>  2     1 Analysis Bootstrap003
#>  3     1 Analysis Bootstrap007
#>  4     1 Analysis Bootstrap008
#>  5     1 Analysis Bootstrap010
#>  6     1 Analysis Bootstrap011
#>  7     1 Analysis Bootstrap012
#>  8     1 Analysis Bootstrap013
#>  9     1 Analysis Bootstrap014
#> 10     1 Analysis Bootstrap015
#> # ... with 734,990 more rows