Introduction to the MachineShop Package

Package Version 1.3.0

Brian J Smith

University of Iowa
brian-j-smith@uiowa.edu

2019-04-23

1 Description

MachineShop is a meta-package for statistical and machine learning with a unified interface for model fitting, prediction, performance assessment, and presentation of results. Support is provided for predictive modeling of numerical, categorical, and censored time-to-event outcomes and for resample (bootstrap, cross-validation, and split training-test sets) estimation of model performance. This vignette introduces the package interface with a survival data analysis example, followed by supported methods of variable specification; applications to other response variable types; available performance metrics, resampling techniques, and graphical and tabular summaries; and modeling strategies.

2 Features

3 Getting Started

3.1 Installation

3.2 Documentation

Once installed, the following R commands will load the package and display its help system documentation. Online documentation and examples are available at the MachineShop website.

4 Melanoma Example

The package is illustrated in the following sections with an overall survival analysis example in which the response variable is a time to event outcome. Since survival outcomes are a combination of numerical (time to event) and categorical (event) variables, package features for both variable types will be utilized in the example. Outcomes other than survival, including nominal and ordinal factors as well as numeric vectors and matrices, are supported by MachineShop and will be discussed.

Survival analysis is performed with the Melanoma dataset from the MASS package (Andersen et al. 1993). This dataset provides survival time, in days, from disease treatment to (1) death from disease, (2) alive at end of study, or (3) death from other causes for 205 Denmark patients with malignant melanomas. Also provided are potential predictors of the survival outcomes. The analysis begins by loading required packages MachineShop, survival, and MASS as well as magrittr (Bache and Wickham 2014) for its pipe (%>%) operator to simplify some of the code syntax. For the analysis, a binary overall survival outcome is created by combining the two death categories (1 and 3) into one. The dataset is then split into a training set on which a survival model will be fit and a test set on which predictions will be made. A global formula surv_fo is defined to relate the predictors on the right hand side to the overall survival outcome on the left and will be used in all of the survival models presented.

## Analysis libraries
library(MachineShop)
library(survival)
library(MASS)
library(magrittr)

## Malignant melanoma analysis dataset
surv_df <- within(Melanoma, status <- as.numeric(status != 2))

Descriptive summaries of the study variables are given below in Table 1, followed by a plot of estimated overall survival probabilities and 95% confidence intervals.

Table 1. Variable summaries for the Melanoma survival analysis example.
Characteristic Value
Number of subjects 205
time
Median (Range) 2005 (10, 5565)
status
1 = Dead 71 (34.63%)
0 = Alive 134 (65.37%)
sex
1 = Male 79 (38.54%)
0 = Female 126 (61.46%)
age
Median (Range) 54 (4, 95)
year
Median (Range) 1970 (1962, 1977)
thickness
Median (Range) 1.94 (0.1, 17.42)
ulcer
1 = Presence 90 (43.9%)
0 = Absence 115 (56.1%)

## Training and test sets
set.seed(123)
train_indices <- sample(nrow(surv_df), nrow(surv_df) * 2 / 3)
surv_train <- surv_df[train_indices, ]
surv_test <- surv_df[-train_indices, ]

## Global formula for the analysis
surv_fo <- Surv(time, status) ~ sex + age + year + thickness + ulcer

5 Model Fit and Prediction

5.1 Model Information

Model fitting requires user specification of a MachineShop compatible model. A named list of package-supplied models can be obtained interactively with the modelinfo function, and includes a descriptive "label", the source "packages" on which the models depend, supported response variable "types", and "arguments" that can be specified in calls to the model functions. Note that in order to use a model, the source packages must be installed with the install.packages or by equivalent means, but need not be loaded with the library function. Function modelinfo can be called with one or more model functions, function names, function calls, or observed response variables; and will return information on all models matching the calling arguments.

## All available models
modelinfo() %>% names
#>  [1] "AdaBagModel"         "AdaBoostModel"       "BARTModel"          
#>  [4] "BARTMachineModel"    "BlackBoostModel"     "C50Model"           
#>  [7] "CForestModel"        "CoxModel"            "CoxStepAICModel"    
#> [10] "EarthModel"          "FDAModel"            "GAMBoostModel"      
#> [13] "GBMModel"            "GLMBoostModel"       "GLMModel"           
#> [16] "GLMStepAICModel"     "GLMNetModel"         "KNNModel"           
#> [19] "LARSModel"           "LDAModel"            "LMModel"            
#> [22] "MDAModel"            "NaiveBayesModel"     "NNetModel"          
#> [25] "PDAModel"            "PLSModel"            "POLRModel"          
#> [28] "QDAModel"            "RandomForestModel"   "RangerModel"        
#> [31] "RPartModel"          "StackedModel"        "SuperModel"         
#> [34] "SurvRegModel"        "SurvRegStepAICModel" "SVMModel"           
#> [37] "SVMANOVAModel"       "SVMBesselModel"      "SVMLaplaceModel"    
#> [40] "SVMLinearModel"      "SVMPolyModel"        "SVMRadialModel"     
#> [43] "SVMSplineModel"      "SVMTanhModel"        "TreeModel"          
#> [46] "XGBModel"            "XGBDARTModel"        "XGBLinearModel"     
#> [49] "XGBTreeModel"

## Survival-specific models
modelinfo(Surv(0)) %>% names
#>  [1] "BARTModel"           "BlackBoostModel"     "CForestModel"       
#>  [4] "CoxModel"            "CoxStepAICModel"     "GAMBoostModel"      
#>  [7] "GBMModel"            "GLMBoostModel"       "GLMNetModel"        
#> [10] "RangerModel"         "RPartModel"          "StackedModel"       
#> [13] "SuperModel"          "SurvRegModel"        "SurvRegStepAICModel"

## Model-specific information
modelinfo(GBMModel)
#> $GBMModel
#> $GBMModel$label
#> [1] "Generalized Boosted Regression"
#> 
#> $GBMModel$packages
#> [1] "gbm"
#> 
#> $GBMModel$types
#> [1] "factor"  "numeric" "Surv"   
#> 
#> $GBMModel$arguments
#> function (distribution = NULL, n.trees = 100, interaction.depth = 1, 
#>     n.minobsinnode = 10, shrinkage = 0.1, bag.fraction = 0.5) 
#> NULL
#> 
#> $GBMModel$grid
#> [1] TRUE
#> 
#> $GBMModel$varimp
#> [1] TRUE

Information is displayed above for the GBMModel function which is a generalized boosted regression model — a tree-based ensemble method that can be applied to survival outcomes.

5.2 Fit Function

Package models, like GBMModel, can be specified in the model argument of the fit function to estimate a relationship (surv_fo) between predictors and an outcome based on a set of data (surv_train). Argument specifications may be in terms of the model function, function name, or a function call.

5.3 Dynamic Model Parameters

Dynamic model parameters are model function arguments defined as expressions to be evaluated at the time of model fitting. As such, their values can change based on the number of observations in the dataset supplied to the fit function. Expressions to dynamic parameters are specified within the package-supplied quote operator .() and can include the following objects:

nobs
number of observations in data.
nvars
number of predictor variables in data.
y
response variable.

In the example below, Bayesian information criterion (BIC) based stepwise variable selection is performed by creating a CoxStepAICModel with dynamic parameter k to be calculated as the log number of observations in the fitted dataset.

5.4 Predict Function

A predict function is supplied for application to model fit results to obtain predicted values on a dataset specified with its newdata argument or on the original dataset if not specified. Survival means are predicted for survival outcomes by default. Estimates of the associated survival distributions are needed to calculate the means. For models, like GBMModel, that perform semi- or non-parametric survival analysis, Weibull approximations to the survival distributions are the default for mean estimation. Other choices of distributional approximations are exponential, Rayleigh, and empirical. Empirical distributions are applicable to Cox proportional hazards-based models and can be calculated with the method of Breslow (1972), Efron (1977, default), or Fleming and Harrington (1984). Note, however, that empirical survival means are undefined mathematically if an event does not occur at the longest follow-up time. In such situations, a restricted survival mean is calculated by changing the longest follow-up time to an event, as suggested by Efron (1967), which will be negatively biased.

In addition to survival means, predicted survival probabilities (type = "prob") or 0-1 survival events (default: type = "response") can be obtained with the follow-up times argument. The cutoff probability for classification of survival events (or other binary responses) can be set optionally with the cutoff argument (default: cutoff = 0.5). As with mean estimation, distributional approximations to the survival functions may be specified for the predictions, with the default for survival probabilities being the empirical distribution.

6 Variable Specifications

Variable specification defines the relationship between response and predictor variables as well as the data used to estimate the relationship. Three main types of specifications are supported by the fit, resample, and tune functions: traditional formula, model frame, and recipe.

6.1 Traditional Formula

Variables may be specified with a traditional formula and data frame pair, as was done at the start of the survival example. With this specification, formula operators, such as interaction (:) and crossing (*), as well as . substitution of variables not already appearing in the formula may be used.

6.2 Design Matrix

Variables stored separately in a design matrix of predictors and object of responses can be supplied to the fit functions directly. Fitting with design matrices has less computational overhead than traditional formulas and allows for greater numbers of predictor variables in some models, including GBMModel, GLMNetModel, and RandomForestModel.

6.3 Model Frame

A ModelFrame class is provided by the package for specification of predictor and response variables along with other attributes to control model fitting. Model frames can be specified in calls to the ModelFrame constructor function with a syntax similar to the traditional formula or design matrix.

The model frame approach has a few advantages over model fitting directly with a traditional formula. One is that cases with missing values on any of the response or predictor variables are excluded from the model frame by default. This is often desirable for models that do not handle missing values. Conversely, missing values can be retained in the model frame by setting its argument na.rm = FALSE for models, like GBMModel, that do handle them. A second advantage is that case weights can be included in the model frame to be passed on to the model fitting functions.

A third, which will be illustrated later, is user-specification of a variable for stratified resampling via the constructor’s strata argument.

6.4 Preprocessing Recipe

The recipes package (Kuhn and Wickham 2018) provides a flexible framework for defining predictor and response variables as well as preprocessing steps to be applied to them prior to model fitting. Using recipes helps ensure that estimation of predictive performance accounts for all modeling step. They are also a convenient way of consistently applying preprocessing to new data. A basic recipe is given below in terms of the formula and data frame ingredients needed for the analysis.

Case weights and stratified resampling are also supported for recipes via the designations of "case_weight" and "case_strata" roles, respectively.

7 Response Variable Types

The R class types of response variables play a central role in their analysis with the package. They determine, for example, the specific models that can be fit, fitting algorithms employed, predicted values produced, and applicable performance metrics and analyses. As described below, factors, ordered factors, numeric vectors and matrices, and survival responses are supported by the package.

7.1 Factors

Categorical responses with two or more levels should be coded as factor variables for analysis.

7.2 Ordered Factors

Ordinal categorical responses should be coded as ordered factor variables. For categorical vectors, this can be accomplished with the factor function and its argument ordered = TRUE or more simply with the ordered function. Numeric vectors can be converted to ordered factors with the cut function.

7.3 Numeric Vectors

Univariate numerical responses should be coded as numeric variables.

7.4 Numeric Matrices

Multivariate numerical responses should be given as numeric matrix variables for model fitting with traditional formulas or model frames.

For recipes, the multiple response variables should be given on the left hand side of the formula specification.

7.5 Survival Objects

Survival responses should be coded as Surv variables for model fitting with traditional formulas or model frames.

For recipes, survival outcomes should be specified with the individual survival time and event variables given on the left hand side of the formula and with their roles designated as "surv_time" and "surv_event".

8 Model Performance Metrics

8.1 Performance Function

Performance metrics quantify associations between observed and predicted responses and provide a means of assessing and comparing the predictive performances of models. Metrics can be computed with the performance function applied to observed responses and responses predicted with the predict function. In the case of observed versus predicted survival probabilities or events, metrics will be calculated at each survival time and returned along with their time-integrated mean.

Function performance computes a default set of metrics according to the observed and predicted response types, as indicated in the table below.

Table 2. Default performance metrics by response types.

Response Default Metrics
Factor Brier Score, Accuracy, Cohen’s Kappa
Binary Factor Brier Score, Accuracy, Cohen’s Kappa, Area Under ROC Curve, Sensitivity, Specificity
Numeric Vector or Matrix Root Mean Squared Error, R2, Mean Absolute Error
Survival Means Concordance Index
Survival Probabilities Area Under ROC Curve, Brier Score, Accuracy
Survival Events Accuracy

These defaults may be changed by specifying one or more package-supplied metric functions to the metrics argument of performance. A named list of supplied metrics can be obtained interactively with the metricinfo function, and includes a descriptive "label", whether to "maximize" the metric for better performance, the function "arguments", and supported response variable "types" for each. Function metricinfo may be called with one or more metric functions, function names, an observed response variable, or an observed and predicted response variable pair; and will return information on all matching metrics.

Specification of the metrics argument can be in terms of a single metric function, function name, or list of metric functions. List names, if specified, will be displayed as metric labels in graphical and tabular summaries; otherwise, the function names will be used as labels for unnamed lists.

Metrics based on classification of two-level class probabilities, like sensitivity and specificity, optionally allow for specification of the classification cutoff probability (default: cutoff = 0.5).

8.2 Factors

Metrics applicable to multi-level factor response variables are summarized below.

accuracy
Proportion of correctly classified responses.
brier
Brier score.
cross_entropy
Cross entropy loss averaged over the number of cases.
kappa2
Cohen’s kappa statistic measuring relative agreement between observed and predicted classifications.
weighted_kappa2
Weighted Cohen’s kappa. This metric is only available for ordered factor responses.

Brier score and cross entropy loss are computed directly on predicted class probabilities. The other metrics are computed on predicted class membership, defined as the factor level with the highest predicted probability.

8.3 Binary Factors

Metrics for binary factors include those given for multi-level factors as well as the following.

auc
Area under a performance curve.
cindex
Concordance index computed as rank order agreement between predicted probabilities for paired event and non-event cases. This metric can be interpreted as the probability that a randomly selected event case will have a higher predicted value than a randomly selected non-event case, and is the same as area under the ROC curve.
f_score
F score, \(F_\beta = (1 + \beta^2) \frac{\text{precision} \times \text{recall}}{\beta^2 \times \text{precision} + \text{recall}}\). F1 score \((\beta = 1)\) is the package default.
fnr
False negative rate, \(FNR = \frac{FN}{TP + FN} = 1 - TPR\).
Table 3. Confusion matrix of observed and predicted response classifications.
Predicted Response
Observed Response
Negative Positive
Negative True Negative (TN) False Negative (FN)
Positive False Positive (FP) True Positive (TP)
fpr
False positive rate, \(FPR = \frac{FP}{TN + FP} = 1 - TNR\).
npv
Negative predictive value, \(NPV = \frac{TN}{TN + FN}\).
ppv, precision
Positive predictive value, \(PPV = \frac{TP}{TP + FP}\).
pr_auc, auc
Area under a precision recall curve.
roc_auc, auc
Area under an ROC curve.
roc_index
A tradeoff function of sensitivity and specificity as defined by the f argument in this function (default: sensitivity + specificity). The function allows for specification of tradeoffs (Perkins and Schisterman 2006) other than the default of Youden’s J statistic (Youden 1950).
rpp
Rate of positive prediction, \(RPP = \frac{TP + FP}{TP + FP + TN + FN}\).
sensitivity, recall, tpr
True positive rate, \(TPR =\frac{TP}{TP + FN} = 1 - FNR\).
specificity, tnr
True negative rate, \(TNR = \frac{TN}{TN + FP} = 1 - FPR\).

Area under the ROC and precision-recall curves as well as the concordance index are computed directly on predicted class probabilities. The other metrics are computed on predicted class membership. Memberships are defined to be in the second factor level if predicted probabilities are greater than the cutoff value set in the performance function.

8.4 Numerics

Performance metrics are defined below for numeric vector responses. If applied to a numeric matrix response, the metrics are computed separately for each column and then averaged to produce a single value.

gini
Gini coefficient.
mae
Mean absolute error, \(MAE = \frac{1}{n}\sum_{i=1}^n|y_i - \hat{y}_i|\), where \(y_i\) and \(\hat{y}_i\) are the \(n\) observed and predicted responses.
mse
Mean squared error, \(MSE = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2\).
msle
Mean squared log error, \(MSLE = \frac{1}{n}\sum_{i=1}^n(log(1 + y_i) - log(1 + \hat{y}_i))^2\).
r2
One minus residual divided by total sums of squares, \(R^2 = 1 - \sum_{i=1}^n(y_i - \hat{y}_i)^2 / \sum_{i=1}^n(y_i - \bar{y})^2\).
rmse
Square root of mean squared error.
rmsle
Square root of mean squared log error.

8.5 Survival Objects

All previously described metrics for binary factor responses—plus accuracy, Brier score and Cohen’s kappa—are applicable to survival probabilities predicted at specified follow-up times. Metrics are evaluated separately at each follow-up time and reported along with a time-integrated mean. The survival concordance index is computed with the method of Harrell (1982) and Brier score according to Graf et al. (1999); whereas, the others are computed according to the confusion matrix probabilities below, in which term \(\hat{S}(t)\) is the predicted survival probability at follow-up time \(t\) and \(T\) is the survival time (Heagerty, Lumley, and Pepe 2004).

Table 4. Confusion matrix of observed and predicted survival response classifications.
Predicted Response
Observed Response
Non-Event Event
Non-Event \(TN = \Pr(\hat{S}(t) \gt \text{cutoff} \cap T \ge t)\) \(FN = \Pr(\hat{S}(t) \gt \text{cutoff} \cap T \lt t)\)
Event \(FP = \Pr(\hat{S}(t) \le \text{cutoff} \cap T \ge t)\) \(TP = \Pr(\hat{S}(t) \le \text{cutoff} \cap T \lt t)\)

In addition, all of the metrics described for numeric vector responses are applicable to predicted survival means and are computed using only those cases with observed (non-censored) events.

9 Resample Performance Estimation

9.1 Algorithms

Model performance can be estimated with resampling methods that simulate repeated training and test set fits and predictions. With these methods, performance metrics are computed on each resample to produce an empirical distribution for inference. Resampling is controlled in the MachineShop with the functions:

BootControl
Simple bootstrap resampling. Models are repeatedly fit with bootstrap resampled training sets and used to predict the full data set.
CVControl
Repeated K-fold cross-validation. The full data set is repeatedly partitioned into K-folds. For a given partitioning, prediction is performed on each of the K folds with models fit on all remaining folds.
OOBControl
Out-of-bootstrap resampling. Models are fit with bootstrap resampled training sets and used to predict the unsampled cases.
SplitControl
Split training and test sets. The data are randomly partitioned into a training and test set.
TrainControl
Training resubstitution. A model is fit on and used to predict the full training set in order to estimate training, or apparent, error.

For the survival example, repeated cross-validation control structures are defined to estimate model performance in predicting survival means and 5 and 10-year survival probabilities. In addition to arguments controlling the resampling algorithms, a seed can be set to ensure reproducibility of resampling results obtained with the structures.

9.2 Parallel Processing

Resampling is implemented with the foreach package (Microsoft and Weston 2017) and will run in parallel if a compatible backend is loaded, such as that provided by the doParallel package (Microsoft and Weston 2018).

9.3 Resample Function

Resampling is performed by calling the resample function with a variable specification, model, and control structure. Like the fit function, variables may be specified in terms of a traditional formula, design matrix, model frame, or recipe. Summary statistics and plots of resample output can be obtained with the summary and plot functions.

The summary function when applied directly to output from resample computes default performance metrics as described in the Performance Function section. Likewise, the metricinfo and performance functions can be applied to the output in order to list and compute applicable metrics.

9.4 Stratified Resampling

Stratification of cases for the construction of resampled training and test sets can be employed to help achieve balance across the sets. Stratified resampling is automatically performed if variable specification is in terms of a traditional formula and will be done according to the response variable if a numeric vector or factor, the event variable if survival, and the first variable if a numeric matrix. For model frames and recipes, stratification variables must be defined explicitly with the strata argument to the ModelFrame constructor or with the "case_strata" role designation in a recipe step.

9.5 Dynamic Model Parameters

As discussed previously in the Model Fit and Prediction section, dynamic model parameters are evaluated at the time of model fitting and can depend on the number of observations in the fitted dataset. In the context of resampling, dynamic parameters are repeatedly evaluated at each fit of the resampled datasets. As such, their values can change based on the observations selected for training at each iteration of the resampling algorithm.

9.6 Model Comparisons

Resampled metrics from different models can be combined for comparison with the Resamples function. Optional names given on the left hand side of equal operators within calls to Resamples will be used as labels in output from the summary and plot functions. For comparisons of resampled output, the same control structure must be used in all associated calls to resample to ensure that resulting model metrics are computed on the same resampled training and test sets.

Pairwise model differences for each metric can be calculated with the diff function applied to results from a call to Resamples. Resulting differences can be summarized descriptively with the summary and plot functions and assessed for statistical significance with the t.test function.

10 Performance Analyses

10.1 Variable Importance

The importance of variables in a model fit is estimated with the varimp function and plotted with plot. Variable importance is a measure of the relative importance of predictors in a model and has a default range of 0 to 100, where 0 denotes the least important variables and 100 the most.

10.2 Calibration Curves

Agreement between model-predicted and observed values can be visualized with calibration curves. In the construction of these curves, cases are partitioned into bins according to their (resampled) predicted responses. Mean observed responses are then calculated within each of the bins and plotted on the vertical axis against the bin midpoints on the horizontal axis. An option to produce curves smoothed over the individual predicted values is also provided. Calibration curves that are close to the 45-degree line indicate close agreement between observed and predicted responses and a model that is said to be well calibrated.

10.3 Confusion Matrices

Confusion matrices of cross-classified observed and predicted factor or survival probabilitie are available with the confusion function. They can be constructed with predicted class membership or with predicted class probabilities. In the latter case, predicted class membership is derived from predicted probabilities according to a probability cutoff value for binary factors and according to the class with highest probability for factors with more than two levels. Performance metrics, such as those described earlier for binary factors, can be computed with the performance function and summarized with summary and plot.

10.4 Partial Dependence Plots

Partial dependence plots display the marginal effects of predictors on the response variable. The response scale displayed in the plots will depend on the response type: probability for predicted factors and survival probabilities, original scale for numerics, and survival time for predicted survival means.

10.5 Performance Curves

Tradeoffs between correct and incorrect classifications of binary outcomes, across the range of possible cutoff probabilities, can be studied with performance curves.

10.5.1 ROC

Receiver operating characteristic (ROC) curves are one example in which true positive rates (sensitivity) are plotted against false positive rates (1 - specificity). Area under resulting ROC curves can be computed as an overall measure of model predictive performance and interpreted as the probability that a randomly selected event case will have a higher predicted value than a randomly selected non-event case.

10.5.2 Precision Recall

In general, any two binary response metrics may be specified for the construction of a performance curve. Precision recall curves are another example.

10.5.3 Lift

Lift curves depict the rate at which observed binary responses are identifiable from (resampled) predicted response probabilities. In particular, they plot the true positive findings (sensitivity) against the positive test rates for all possible classification probability cutoffs. Accordingly, a lift curve can be interpreted as the rate at which positive responses are found as a function of the positive test rate among cases.

11 Modeling Strategies

11.1 Model Tuning

Many of the modeling functions have arguments, or parameters, that control aspects of their model fitting algorithms. For example, GBMModel parameters n.trees and interaction.depth control the number of decision trees to fit and the maximum depth of variable interactions. The tune function performs model fitting over a grid of parameter values and returns the model with the most optimal values. Optimality is determined based on the first performance metric supplied to the metrics argument of tune. Furthermore, argument grid controls the construction of grid values and can be a single numeric value giving the grid length in each parameter dimension, a call to Grid with the grid length and number of grid points to sample at random, or a user-specified data frame of grid points. Summary statistics and plots of resulting performances across all metrics and tuning parameters can be obtained with the summary and plot functions.

## Tune over automatic grid of model parameters
(surv_tune <- tune(surv_fo, data = surv_df, model = GBMModel,
                   grid = 3,
                   control = surv_means_control,
                   metrics = c("CIndex" = cindex, "RMSE" = rmse)))
#> Object of class "MLModelTune"
#> 
#> Model name: GBMModel
#> Label: Generalized Boosted Regression
#> Packages: gbm
#> Response types: factor, numeric, Surv
#> 
#> Parameters:
#> $n.trees
#> [1] 50
#> 
#> $interaction.depth
#> [1] 1
#> 
#> $n.minobsinnode
#> [1] 10
#> 
#> $shrinkage
#> [1] 0.1
#> 
#> $bag.fraction
#> [1] 0.5
#> 
#> Grid:
#>   n.trees interaction.depth
#> 1      50                 1
#> 2     100                 1
#> 3     150                 1
#> 4      50                 2
#> 5     100                 2
#> 6     150                 2
#> 7      50                 3
#> 8     100                 3
#> 9     150                 3
#> 
#> Object of class "Performance"
#> 
#> Metrics: CIndex, RMSE
#> Models: GBMModel.1, GBMModel.2, GBMModel.3, GBMModel.4, GBMModel.5, GBMModel.6, GBMModel.7, GBMModel.8, GBMModel.9 
#> 
#> Selected (CIndex): GBMModel.1

summary(surv_tune)
#> , , CIndex
#> 
#>                 Mean    Median         SD       Min       Max NA
#> GBMModel.1 0.7058983 0.6883562 0.06256998 0.6300000 0.8359788  0
#> GBMModel.2 0.6941198 0.6762295 0.05933365 0.6244541 0.8227513  0
#> GBMModel.3 0.6850918 0.6745843 0.06864062 0.5805085 0.8174603  0
#> GBMModel.4 0.6961474 0.6886792 0.05489630 0.6207627 0.8028504  0
#> GBMModel.5 0.6828937 0.6792453 0.06191098 0.5677966 0.8015873  0
#> GBMModel.6 0.6785267 0.6933962 0.06486174 0.5868644 0.7885986  0
#> GBMModel.7 0.6925677 0.6721311 0.05600214 0.6165254 0.8004751  0
#> GBMModel.8 0.6778869 0.6650943 0.05688398 0.5889831 0.7838480  0
#> GBMModel.9 0.6709213 0.6745283 0.06440546 0.5805085 0.8028504  0
#> 
#> , , RMSE
#> 
#>                Mean   Median       SD      Min       Max NA
#> GBMModel.1 4622.523 4620.080 1536.488 2625.016  8292.560  0
#> GBMModel.2 4828.093 4905.842 1455.894 2847.789  7503.051  0
#> GBMModel.3 4984.213 5181.052 1548.834 2124.061  7327.174  0
#> GBMModel.4 4785.732 4823.784 1644.107 2669.812  7365.786  0
#> GBMModel.5 4396.219 4235.812 1723.906 1886.119  7491.740  0
#> GBMModel.6 3810.610 4116.190 1476.657 1772.266  6017.007  0
#> GBMModel.7 5013.930 4504.273 1766.908 2760.226  9407.171  0
#> GBMModel.8 4984.817 4944.546 2052.461 1831.480  8473.961  0
#> GBMModel.9 5213.172 4556.378 2745.462 2307.150 12983.578  0

plot(surv_tune, type = "line")

The return value of tune is a model object with the optimal tuning parameters and not a model fit object. The returned model can be fit subsequently to a set of data with the fit function.

11.2 Model Selection

Model selection can be performed with the tune function to select from any combination of models and model parameters. It has as a special case the just-discussed tuning of a single model over a grid of parameter values. In general, a list containing any combination of model functions, function names, and function calls can be supplied to the models argument of tune to perform model selection. An expand.model helper function is additionally provided to expand a model over a grid of tuning parameters for inclusion in the list if so desired. In this general form of model selection, the grid argument discussed previously for grid tuning is not used.

11.3 Ensemble Models

Ensemble methods combine multiple base learning algorithms as a strategy to improve predictive performance. Two ensemble methods implemented in Machineshop are stacked regression (Breiman 1996) and super learners (van der Lann and Hubbard 2007). Stacked regression fits a linear combination of resampled predictions from specified base learners; whereas, super learners fit a specified model, such as GBMModel, to the base learner predictions and optionally also to the original predictor variables. Illustrated below is a performance evaluation of stacked regression and a super learner fit to gradient boosted, random forest, and Cox regression base learners. In the second case, a separate gradient boosted model is used as the super learner.

12 Package Extensions

Custom models and metrics can be defined with the MLModel and MLMetric constructors for use with the model fitting, prediction, and performance assessment tools provided by the package.

## Logistic regression model
LogisticModel <- MLModel(
  name = "LogisticModel",
  types = "binary",
  fit = function(formula, data, weights, ...) {
    glm(formula, data = data, weights = weights, family = binomial, ...)
  },
  predict = function(object, newdata, ...) {
    predict(object, newdata = newdata, type = "response")
  },
  varimp = function(object, ...) {
    pchisq(coef(object)^2 / diag(vcov(object)), 1)
  }
)

## F2 score metric
f2_score <- MLMetric(
  function(observed, predicted, ...) {
    f_score(observed, predicted, beta = 2, ...)
  },
  name = "f2_score",
  label = "F2 Score",
  maximize = TRUE
)

library(MASS)
res <- resample(type ~ ., data = Pima.tr, model = LogisticModel)
summary(performance(res, metric = f2_score))
#>               Mean    Median        SD       Min       Max NA
#> f2_score 0.5697769 0.6060606 0.1924873 0.1666667 0.8571429  0

13 Model Constructor Functions

Package-supplied model constructor functions and supported response variable types.
Response Variable Types
Function Label Categorical1 Continuous2 Survival3
AdaBagModel Bagging with Classification Trees f
AdaBoostModel Boosting with Classification Trees f
BARTModel Bayesian Additive Regression Trees f n S
BARTMachineModel Bayesian Additive Regression Trees b n
BlackBoostModel Gradient Boosting with Regression Trees b n S
C50Model C5.0 Classification f
CForestModel Conditional Random Forests f n S
CoxModel Cox Regression S
CoxStepAICModel Cox Regression (Stepwise) S
EarthModel Multivariate Adaptive Regression Splines f n
FDAModel Flexible Discriminant Analysis f
GAMBoostModel Gradient Boosting with Additive Models b n S
GBMModel Generalized Boosted Regression f n S
GLMBoostModel Gradient Boosting with Linear Models b n S
GLMModel Generalized Linear Models b n
GLMStepAICModel Generalized Linear Models (Stepwise) b n
GLMNetModel Lasso and Elastic-Net f m, n S
KNNModel K-Nearest Neighbors Model f, o n
LARSModel Least Angle Regression n
LDAModel Linear Discriminant Analysis f
LMModel Linear Model f m, n
MDAModel Mixture Discriminant Analysis f
NaiveBayesModel Naive Bayes Classifier f
NNetModel Feed-Forward Neural Networks f n
PDAModel Penalized Discriminant Analysis f
PLSModel Partial Least Squares f n
POLRModel Ordered Logistic Regression o
QDAModel Quadratic Discriminant Analysis f
RandomForestModel Random Forests f n
RangerModel Fast Random Forests f n S
RPartModel Recursive Partitioning and Regression Trees f n S
StackedModel Stacked Regression f, o m, n S
SuperModel Super Learner f, o m, n S
SurvRegModel Parametric Survival S
SurvRegStepAICModel Parametric Survival (Stepwise) S
SVMModel Support Vector Machines f n
SVMANOVAModel Support Vector Machines (ANOVA) f n
SVMBesselModel Support Vector Machines (Bessel) f n
SVMLaplaceModel Support Vector Machines (Laplace) f n
SVMLinearModel Support Vector Machines (Linear) f n
SVMPolyModel Support Vector Machines (Poly) f n
SVMRadialModel Support Vector Machines (Radial) f n
SVMSplineModel Support Vector Machines (Spline) f n
SVMTanhModel Support Vector Machines (Tanh) f n
TreeModel Regression and Classification Trees f n
XGBModel Extreme Gradient Boosting f n
XGBDARTModel Extreme Gradient Boosting (DART) f n
XGBLinearModel Extreme Gradient Boosting (Linear) f n
XGBTreeModel Extreme Gradient Boosting (Tree) f n
1 b = binary factor, f = factor, o = ordered factor
2 m = matrix, n = numeric
3 S = Surv

14 Metric Functions

Package-supplied performance metric functions and supported response variable types.
Response Variable Types
Function Label Categorical1 Continuous2 Survival3
accuracy Accuracy f S
auc Area Under Performance Curve b S
brier Brier Score f S
cindex Concordance Index b S
cross_entropy Cross Entropy f
f_score F Score b S
fnr False Negative Rate b S
fpr False Positive Rate b S
gini Gini Coefficient m, n S
kappa2 Cohen’s Kappa f S
mae Mean Absolute Error m, n S
mse Mean Squared Error m, n S
msle Mean Squared Log Error m, n S
npv Negative Predictive Value b S
ppv Positive Predictive Value b S
pr_auc Area Under Precision-Recall Curve b S
precision Precision b S
r2 Coefficient of Determination m, n S
recall Recall b S
rmse Root Mean Squared Error m, n S
rmsle Root Mean Squared Log Error m, n S
roc_auc Area Under ROC Curve b S
roc_index ROC Index b S
rpp Rate of Positive Prediction b S
sensitivity Sensitivity b S
specificity Specificity b S
tnr True Negative Rate b S
tpr True Positive Rate b S
weighted_kappa2 Weighted Cohen’s Kappa o
1 b = binary factor, f = factor, o = ordered factor
2 m = matrix, n = numeric
3 S = Surv

References

Andersen, Per K, Ornulf Borgan, Richard D Gill, and Niels Keiding. 1993. Statistical Models Based on Counting Processes. New York: Springer.

Bache, Stefan Milton, and Hadley Wickham. 2014. magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.

Breiman, Leo. 1996. “Stacked Regression.” Machine Learning 24: 49–64.

Breslow, Norman E. 1972. “Discussion of Professor Cox’s Paper.” Journal of the Royal Statistical Society, Series B 34: 216–17.

Efron, Bradley. 1967. “The Two Sample Problem with Censored Data.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 4: Biology and Problems of Health, 831–53. Berkeley, California: University of California Press.

———. 1977. “The Efficiency of Cox’s Likelihood Function for Censored Data.” Journal of the American Statistical Association 72 (359): 557–65.

Fleming, Thomas R, and David P Harrington. 1984. “Nonparametric Estimation of the Survival Distribution in Censored Data.” Communications in Statistics - Theory and Methods 13 (20): 2469–86.

Graf, Erika, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. 1999. “Assessment and Comparison of Prognostic Classification Schemes for Survival Data.” Statistics in Medicine 18 (17–18): 2529–45.

Harrell, Frank E, Robert M Califf, David B Pryor, Kerry L Lee, and Robert A Rosati. 1982. “Evaluating the Yield of Medical Tests.” JAMA 247 (18): 2543–6.

Heagerty, Patrick J, Thomas Lumley, and Margaret S Pepe. 2004. “Time-Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker.” Biometrics 56 (2): 337–44.

Kuhn, Max, and Hadley Wickham. 2018. recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes.

Microsoft, and Steve Weston. 2017. foreach: Provides Foreach Looping Construct for R. https://CRAN.R-project.org/package=foreach.

———. 2018. doParallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel.

Perkins, Neil J, and Enrique F Schisterman. 2006. “The Inconsistency of "Optimal" Cutpoints Obtained Using Two Criteria Based on the Receiver Operating Characteristic Curve.” American Journal of Epidemiology 163 (7): 670–75.

van der Lann, Mark J, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).

Youden, William J. 1950. “Index for Rating Diagnostic Tests.” Cancer 3 (1): 32–35.