Introduction to the MachineShop Package

Brian J Smith

2019-01-02

The MachineShop Package

MachineShop is a meta-package for statistical and machine learning with a common interface for model fitting, prediction, performance assessment, and presentation of results. Support is provided for predictive modeling of numerical, categorical, and censored time-to-event outcomes and for resample (bootstrap, cross-validation, and split training-test sets) estimation of model performance. This vignette introduces the package interface with a survival data analysis example, followed by applications to other types of response variables, supported methods of model specification and data preprocessing, and a list of all currently available models.

Model Fitting and Prediction

The Melanoma dataset from the MASS package (Andersen et al. 1993) contains time, in days, to (1) death from disease, (2) alive at end of study, or (3) death from other causes for 205 Denmark patients with malignant melanomas. Also provided are potential predictors of the survival outcomes. We begin by loading the MachineShop, survival, and MASS packages required for the analysis as well as the magrittr package (Bache and Wickham 2014) for its pipe (%>%) operator to simplify some of the code syntax. The dataset is split into a training set to which a survival model will be fit and a test set on which to make predictions. A global formula fo relates the predictors on the right hand side to the overall survival outcome on the left and will be used in all of the survival models in this vignette example.

## Load libraries for the survival analysis
library(MachineShop)
library(survival)
library(MASS)
library(magrittr)

## Malignant melanoma cancer dataset
head(Melanoma)
#>   time status sex age year thickness ulcer
#> 1   10      3   1  76 1972      6.76     1
#> 2   30      3   1  56 1968      0.65     0
#> 3   35      2   1  41 1977      1.34     0
#> 4   99      3   0  71 1968      2.90     0
#> 5  185      1   1  52 1965     12.08     1
#> 6  204      1   1  28 1971      4.84     1

## Create training and test sets
n <- nrow(Melanoma) * 2 / 3
train <- head(Melanoma, n)
test <- head(Melanoma, -n)

## Global formula for the analysis
fo <- Surv(time, status != 2) ~ sex + age + year + thickness + ulcer

Generalized boosted regression models are a tree-based ensemble method that can applied to survival outcomes. They are available in the MachineShop with the function GBMModel. A call to the function creates an instance of the model containing any user-specified model parameters and internal machinery for model fitting, prediction, and performance assessment. Created models can be supplied to the fit function to estimate a relationship (fo) between predictors and an outcome based on a set of data (train). The importance of variables in a model fit is estimated with the varimp function and plotted with plot. Variable importance is a measure of the relative importance of predictors in a model and has a default range of 0 to 100, where 0 denotes the least important variables and 100 the most.

## Fit a generalized boosted model
gbmfit <- fit(fo, data = train, model = GBMModel)

## Predictor variable importance
(vi <- varimp(gbmfit))
#>             Overall
#> year      100.00000
#> thickness  63.17344
#> age        33.19405
#> ulcer      13.76512
#> sex         0.00000

plot(vi)

From the model fit, predictions are obtained at 2, 5, and 10 years as survival probabilities (type = "prob") and as 0-1 death indicators (default: type = "response").

## Predict survival probabilities and outcomes at specified follow-up times
times <- 365 * c(2, 5, 10)
predict(gbmfit, newdata = test, times = times, type = "prob") %>% head
#>           [,1]       [,2]        [,3]
#> [1,] 0.8659876 0.67458405 0.539432251
#> [2,] 0.8285838 0.59782467 0.446353240
#> [3,] 0.9837047 0.95604527 0.931946921
#> [4,] 0.8241337 0.58908117 0.436160014
#> [5,] 0.3119026 0.04127331 0.006752073
#> [6,] 0.8470514 0.63498855 0.490621484

predict(gbmfit, newdata = test, times = times) %>% head
#>      [,1] [,2] [,3]
#> [1,]    0    0    0
#> [2,]    0    0    1
#> [3,]    0    0    0
#> [4,]    0    0    1
#> [5,]    1    1    1
#> [6,]    0    0    1

A call to performance with observed and predicted outcomes will produce model performance metrics. The metrics produced will depend on the type of the observed variable. In this case of a Surv variable, the metrics are area under the ROC curve (Heagerty, Lumley, and Pepe 2004) and Brier score (Graf et al. 1999) at the specified times and overall time-integrated averages.

## Model performance metrics
obs <- response(fo, test)
pred <- predict(gbmfit, newdata = test, times = times, type = "prob")
performance(obs, pred, times = times)
#>    ROC.mean   ROC.time1   ROC.time2   ROC.time3  Brier.mean Brier.time1 
#>   0.9322072   0.7293209   0.9829287   0.9829287         NaN   0.2124870 
#> Brier.time2 Brier.time3 
#>         NaN         NaN

Resample Estimation of Model Performance

Performance of a model can be estimated with resampling methods that simulate repeated training and test set fits and prediction. Performance metrics are computed on each resample to produce an empirical distribution for inference. Resampling is controlled in the MachineShop with the functions:

BootControl
Simple bootstrap resampling. Models are fit with bootstrap resampled training sets and used to predict the full data set.
CVControl
Repeated K-fold cross-validation. The full data set is repeatedly partitioned into K-folds. Within a partitioning, prediction is performed on each of the K folds with models fit on all remaining folds.
OOBControl
Out-of-bootstrap resampling. Models are fit with bootstrap resampled training sets and used to predict the unsampled cases.
SplitControl
Split training and test sets. The data are randomly partitioned into a training and test set.
TrainControl
Training resubstitution. A model is fit on and used to predict the full training set to estimate training, or apparent, error.

In our example, performance of models to predict survival at 2, 5, and 10 years will be estimated with five repeats of 10-fold cross-validation. Variable metrics is defined for the purpose of reducing the printed and plotted output in this vignette to only the time-integrated ROC and Brier metrics. Such subsetting of output would not be done in practice if there is interest in seeing all metrics.

## Control parameters for repeated K-fold cross-validation
control <- CVControl(
  folds = 10,
  repeats = 5,
  surv_times = 365 * c(2, 5, 10)
)

## Metrics of interest
metrics <- c("ROC.mean", "Brier.mean")

Parallel Computing

Resampling is implemented with the foreach package (Microsoft and Weston 2017b) and will run in parallel if a compatible backend is loaded, such as that provided by the doParallel package (Microsoft and Weston 2017a).

Resampling for One Model

Resampling of a single model is performed with the resample function applied to a model object (e.g. GBMModel) and a control object like the one defined previously (control). Summary statistics and plots can be obtained with the summary and plot functions.

Resampling for Model Comparisons

Resampled metrics from different models can be combined for comparison with the Resamples function. Names given on the left hand side of the equal operators in the call to Resamples will be used as labels in output from the summary and plot functions. For these types of model comparisons, the same control structure should be used in all associated calls to resample to ensure that resulting model metrics are computed on the same resampled training and test sets.

Pairwise model differences for each metric can be calculated with the diff function applied to results from a call to Resamples. The differences can be summarized descriptively with the summary and plot functions and assessed for statistical significance with the t.test function.

Model Tuning

Modelling functions may have arguments that define parameters in their model fitting algorithms. For example, GBMModel has arguments n.trees, interaction.dept, and n.minobsinnode that defined the number of decision trees to fit, the maximum depth of variable interactions, and the minimum number of observations in the trees terminal nodes. The tune function is available to fit a model over a grid of parameters and return the model whose parameters provide the optimal fit. Note that the function name GBMModel, and not the function call GBMModel(), is supplied as the first argument to tune. Summary statistics and plots of performance across all tuning parameters are available with the summary and plot functions.

## Tune over a grid of model parameters
(gbmtune <- tune(fo, data = Melanoma, model = GBMModel,
                 grid = expand.grid(n.trees = c(25, 50, 100),
                                    interaction.depth = 1:3,
                                    n.minobsinnode = c(5, 10)),
                 control = control))
#> An object of class "MLModelTune"
#> 
#> Model name: GBMModel
#> Label: Generalized Boosted Regression
#> Packages: gbm
#> Response types: factor, numeric, Surv
#> 
#> Parameters:
#> $n.trees
#> [1] 25
#> 
#> $interaction.depth
#> [1] 1
#> 
#> $n.minobsinnode
#> [1] 10
#> 
#> $shrinkage
#> [1] 0.1
#> 
#> $bag.fraction
#> [1] 0.5
#> 
#> Grid:
#>    n.trees interaction.depth n.minobsinnode
#> 1       25                 1              5
#> 2       50                 1              5
#> 3      100                 1              5
#> 4       25                 2              5
#> 5       50                 2              5
#> 6      100                 2              5
#> 7       25                 3              5
#> 8       50                 3              5
#> 9      100                 3              5
#> 10      25                 1             10
#> 11      50                 1             10
#> 12     100                 1             10
#> 13      25                 2             10
#> 14      50                 2             10
#> 15     100                 2             10
#> 16      25                 3             10
#> 17      50                 3             10
#> 18     100                 3             10
#> 
#> An object of class "Performance"
#> 
#> Metrics: ROC.mean, ROC.time1, ROC.time2, ROC.time3, Brier.mean, Brier.time1, Brier.time2, Brier.time3 
#> Models: GBMModel.1, GBMModel.2, GBMModel.3, GBMModel.4, GBMModel.5, GBMModel.6, GBMModel.7, GBMModel.8, GBMModel.9, GBMModel.10, GBMModel.11, GBMModel.12, GBMModel.13, GBMModel.14, GBMModel.15, GBMModel.16, GBMModel.17, GBMModel.18 
#> 
#> Selected (ROC.mean): GBMModel.10

summary(gbmtune)[, , metrics]
#> , , ROC.mean
#> 
#>                  Mean    Median        SD       Min       Max NA
#> GBMModel.1  0.7084099 0.7334550 0.1317115 0.2912007 0.9378315  0
#> GBMModel.2  0.7044935 0.7216624 0.1280673 0.3786458 0.9209873  0
#> GBMModel.3  0.6914334 0.7205510 0.1229936 0.4113761 0.9116389  0
#> GBMModel.4  0.6920790 0.7160629 0.1221498 0.3727638 0.9132602  0
#> GBMModel.5  0.6848302 0.6918460 0.1250106 0.3704309 0.9528189  0
#> GBMModel.6  0.6691489 0.6816788 0.1391156 0.2993490 0.9158057  0
#> GBMModel.7  0.6879942 0.6914624 0.1282246 0.3758064 0.9396974  0
#> GBMModel.8  0.6769710 0.6873528 0.1358296 0.3204427 0.9218421  0
#> GBMModel.9  0.6527250 0.6528089 0.1456248 0.2634115 0.9004983  0
#> GBMModel.10 0.7126959 0.7249419 0.1303829 0.3231771 0.9539048  0
#> GBMModel.11 0.7078389 0.7232105 0.1289528 0.3888021 0.9324742  0
#> GBMModel.12 0.6860424 0.7087417 0.1311274 0.3040365 0.9243192  0
#> GBMModel.13 0.6994989 0.7266897 0.1234906 0.3975260 0.9485726  0
#> GBMModel.14 0.6921657 0.7104928 0.1336868 0.3523811 0.9336789  0
#> GBMModel.15 0.6814255 0.6918727 0.1373640 0.3334635 0.9053087  0
#> GBMModel.16 0.6956087 0.7147426 0.1366645 0.3521014 0.9636788  0
#> GBMModel.17 0.6929889 0.6900073 0.1289404 0.3296225 0.9262851  0
#> GBMModel.18 0.6744827 0.6672509 0.1368483 0.3111367 0.9409702  0
#> 
#> , , Brier.mean
#> 
#>                  Mean    Median         SD        Min       Max   NA
#> GBMModel.1  0.1902605 0.1813742 0.04592658 0.11447004 0.3193480 0.06
#> GBMModel.2  0.1961846 0.1886240 0.04975919 0.10223360 0.3154243 0.06
#> GBMModel.3  0.2078215 0.1938710 0.05463432 0.10857945 0.3326337 0.06
#> GBMModel.4  0.2000988 0.1889826 0.05218615 0.11031448 0.3438735 0.06
#> GBMModel.5  0.2083254 0.1956902 0.05685070 0.10189786 0.4286941 0.06
#> GBMModel.6  0.2208186 0.2032064 0.06503518 0.11316272 0.4389034 0.06
#> GBMModel.7  0.2064209 0.1902088 0.06189791 0.11349287 0.4031801 0.06
#> GBMModel.8  0.2158404 0.2015343 0.06175487 0.12184607 0.4338749 0.06
#> GBMModel.9  0.2298594 0.2177718 0.06491117 0.13337052 0.4328804 0.06
#> GBMModel.10 0.1860707 0.1760822 0.04476337 0.10706487 0.2978184 0.06
#> GBMModel.11 0.1892243 0.1814450 0.04603588 0.09965699 0.2994825 0.06
#> GBMModel.12 0.1984230 0.1878544 0.05065880 0.11119681 0.3191915 0.06
#> GBMModel.13 0.1913985 0.1792700 0.04994485 0.10789194 0.3220089 0.06
#> GBMModel.14 0.1934831 0.1834812 0.04891230 0.10869662 0.3102272 0.06
#> GBMModel.15 0.2048976 0.1832971 0.05809369 0.12257865 0.3389919 0.06
#> GBMModel.16 0.1923398 0.1777437 0.05219178 0.11132775 0.3530243 0.06
#> GBMModel.17 0.1967562 0.1889953 0.05172487 0.11615526 0.3188928 0.06
#> GBMModel.18 0.2100014 0.1980636 0.05540928 0.12776140 0.3323675 0.06

plot(gbmtune, type = "line", metrics = metrics)

The value returned by tune contains an object produced by a call to the modelling function with the the optimal tuning parameters. Thus, the value can be passed on to the fit function for model fitting to a set of data.

Ensemble Models

Ensemble methods combine multiple base learning algorithms as a strategy to improve predictive performance. Two ensemble methods implemented in Machineshop are stacked regression (Breiman 1996) and super learners (Lann and Hubbard 2007). Stacked regression fits a linear combination of resampled predictions from specified base learners; whereas, super learners fit a specified model, such as GBMModel, to the base learner predictions and optionally also to the original predictor variables. Illustrated below is a performance evaluation of stacked regression and a super learner fit to gradient boosted, conditional forest, and Lasso-based Cox regression base learners. In the latter case, a separate gradient boosted model is used as the super learner by default.

Partial Dependence Plots

Partial dependence plots display the marginal effects of predictors on the response variable. The response scale displayed in the plots will depend on the response type; i.e. probabilities for factor, original scale for numeric, and proportional risk for Surv types.

Calibration Curves

Agreement between model-predicted and observed values can be visualized with calibration curves. In the construction of these curves, cases are partitioned into bins according to their (resampled) predicted responses. Mean observed responses are then calculated within each of the bins and plotted on the vertical axis against the bin midpoints on the horizontal axis. Calibration curves that are close to the 45-degree line indicate close agreement between observed and predicted responses and a model that is said to be well calibrated.

Lift Curves

Lift curves depict the rate at which observed binary responses are identifiable from (resampled) predicted response probabilities. They are constructed by first sorting predicted responses in descending order. Then, the cumulative percent of positive responses (true positive findings) is plotted against the cumulative number of cases (positive test rates) in the ordering. Accordingly, the curve represents the rate at which positive responses are found as a function of the positive test rate among cases.

Response Variable Types

Categorical

Categorical responses with two or more levels should be code as a factor variable for analysis. The metrics returned will depend on the number of factor levels. Metrics for factors with two levels are as follows.

Accuracy
Proportion of correctly classified responses.
Kappa
Cohen’s kappa statistic measuring relative agreement between observed and predicted classifications.
Brier
Brier score.
ROCAUC
Area under the ROC curve.
PRAUC
Area under the precision-recall curve.
Sensitivity
Proportion of correctly classified values in the second factor level.
Specificity
Proportion of correctly classified values in the first factor level.
Index
A tradeoff function of sensitivity and specificity as defined by cutoff_index in the resampling control functions (default: Sensitivity + Specificity). The function allows for specification of tradeoffs (Perkins and Schisterman 2006) other than the default of Youden’s J statistic (Youden 1950).

Brier, ROCAUC, and PRAUC are computed directly on predicted class probabilities. The others are computed on predicted class membership. Memberships are defined to be in the second factor level if predicted probabilities are greater than a cutoff value defined in the resampling control functions (default: cutoff = 0.5).

Metrics for factors with more than two levels are as described below.

Accuracy
Proportion of correctly classified responses.
Kappa
Cohen’s kappa statistic measuring relative agreement between observed and predicted classifications.
WeightedKappa
Weighted Cohen’s kappa with equally spaced weights. This metric is only computed for ordered factor responses.
Brier
Multi-class Brier score.
CrossEntropy
Average cross entropy loss.

Brier and CrossEntropy are computed directly on predicted class probabilities. The others are computed on predicted class membership, defined as the factor level with the highest predicted probability.

Numerical

Numerical responses should be coded as a numeric variable. Associated performance metrics are as defined below and illustrated with Boston housing price data (Venables and Ripley 2002).

R2
One minus residual divided by total sums of squares, \[R^2 = 1 - \frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{\sum_{i=1}^n(y_i - \bar{y})^2},\] where \(y_i\) and \(\hat{y}_i\) are the \(n\) observed and predicted responses.
RMSE
Root mean square error, \[RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2}.\]
MAE
Median absolute error, \[MAE = \operatorname{median}(|y_i - \operatorname{median}(y)|).\]

Survival

Survival responses should be coded as a Surv variable. In addition to the ROC and Brier survival metrics described earlier in the vignette, the concordance index (Harrell et al. 1982) can be obtained if follow-up times are not specified for the prediction.

Model Specifications

Model specification here refers to the relationship between the response and predictor variables and the data used to estimate it. Three main types of specification are supported by the fit, resample, and tune functions: formulas, model frames, and recipes.

Formulas

Models may be specified with the traditional formula and data frame pair, as was done in the previous examples. In this specification, in-line functions, interactions, and . substitution of variables not already appearing in the formula may be include.

Model Frames

The second specification is similar to the first, except the formula and data frame pair are give in a ModelFrame. The model frame approach has a few subtle advantages. One is that cases with missing values on any of the response or predictor variables are excluded from the model frame by default. This is often desirable for models that cannot handle missing values. Note, however, that some models like GBMModel do accommodate missing values. For those, missing values can be retained in the model frame by setting its argument na.action = na.pass. A second advantage is user-specification of a factor for stratified resampling via the strata argument.

A third is that case weights can be included in the model frame and will be passed on to the model fitting functions.

Preprocessing Recipes

The recipes package (Kuhn and Wickham 2018) provides a framework for defining predictor and response variables and preprocessing steps to be applied to them prior to model fitting. Using recipes helps to ensure that estimation of predictive performance accounts for all modeling step. They are also a very convenient way of consistently applying preprocessing to new data. Stratified resampling and case weights are supported for recipes via the designations of "case_strata" and "case_weight" roles, respectively. Recipes currently support factor and numeric responses, but not generally Surv.

Available Models

Currently available model functions are summarized in the table below according to the types of response variables with which each model can be used. The package additionally supplies a generic MLModel function for users to create their own custom models.

Response Variable Types
Method Constructor Categorical1 Continuous2 Survival3
Bagging with Classification Trees AdaBagModel f
Boosting with Classification Trees AdaBoostModel f
Bayesian Additive Regression Trees BARTMachineModel b n
Gradient Boosting with Regression Trees BlackBoostModel b n S
C5.0 Classification C50Model f
Conditional Random Forests CForestModel f n S
Cox Regression CoxModel S
Cox Regression (Stepwise) CoxStepAICModel S
Multivariate Adaptive Regression Splines EarthModel f n
Flexible Discriminant Analysis FDAModel f
Gradient Boosting with Additive Models GAMBoostModel b n S
Generalized Boosted Regression GBMModel f n S
Gradient Boosting with Linear Models GLMBoostModel b n S
Generalized Linear Models GLMModel b n
Generalized Linear Models (Stepwise) GLMStepAICModel b n
Lasso and Elastic-Net GLMNetModel f m, n S
K-Nearest Neighbors Model KNNModel f, o n
Least Angle Regression LARSModel n
Linear Discriminant Analysis LDAModel f
Linear Model LMModel f m, n
Mixture Discriminant Analysis MDAModel f
Naive Bayes Classifier NaiveBayesModel f
Feed-Forward Neural Networks NNetModel f n
Penalized Discriminant Analysis PDAModel f
Partial Least Squares PLSModel f n
Ordered Logistic Regression POLRModel o
Quadratic Discriminant Analysis QDAModel f
Random Forests RandomForestModel f n
Fast Random Forests RangerModel f n S
Recursive Partitioning and Regression Trees RPartModel f n S
Stacked Regression StackedModel f, o m, n S
Super Learner SuperModel f, o m, n S
Parametric Survival SurvRegModel S
Parametric Survival (Stepwise) SurvRegStepAICModel S
Support Vector Machines SVMModel f n
Support Vector Machines (ANOVA) SVMANOVAModel f n
Support Vector Machines (Bessel) SVMBesselModel f n
Support Vector Machines (Laplace) SVMLaplaceModel f n
Support Vector Machines (Linear) SVMLinearModel f n
Support Vector Machines (Poly) SVMPolyModel f n
Support Vector Machines (Radial) SVMRadialModel f n
Support Vector Machines (Spline) SVMSplineModel f n
Support Vector Machines (Tanh) SVMTanhModel f n
Regression and Classification Trees TreeModel f n
Extreme Gradient Boosting XGBModel f n
Extreme Gradient Boosting (DART) XGBDARTModel f n
Extreme Gradient Boosting (Linear) XGBLinearModel f n
Extreme Gradient Boosting (Tree) XGBTreeModel f n
1 b = binary, f = factor, o = ordered
2 m = matrix, n = numeric
3 S = Surv

References

Andersen, PK, O Borgan, RD Gill, and N Keiding. 1993. Statistical Models Based on Counting Processes. New York: Springer.

Bache, Stefan Milton, and Hadley Wickham. 2014. Magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.

Breiman, L. 1996. “Stacked Regression.” Machine Learning 24: 49–64.

Graf, E, C Schmoor, W Sauerbrei, and M Schumacher. 1999. “Assessment and Comparison of Prognostic Classification Schemes for Survival Data.” Statistics in Medicine 18 (17–18): 2529–45.

Harrell, FE, RM Califf, DB Pryor, KL Lee, and RA Rosati. 1982. “Evaluating the Yield of Medical Tests.” JAMA 247 (18): 2543–6.

Heagerty, PJ, T Lumley, and MS Pepe. 2004. “Time-Dependent Roc Curves for Censored Survival Data and a Diagnostic Marker.” Biometrics 56 (2): 337–44.

Kuhn, Max, and Hadley Wickham. 2018. Recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes.

Lann, MJ van der, and AE Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).

Microsoft, and Steve Weston. 2017a. DoParallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel.

———. 2017b. Foreach: Provides Foreach Looping Construct for R. https://CRAN.R-project.org/package=foreach.

Perkins, Neil J., and Enrique F. Schisterman. 2006. “The Inconsistency of "Optimal" Cutpoints Obtained Using Two Criteria Based on the Receiver Operating Characteristic Curve.” American Journal of Epidemiology 163 (7): 670–75.

Venables, WN, and BD Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.

Youden, WJ. 1950. “Index for Rating Diagnostic Tests.” Cancer 3 (1): 32–35.