A new model using the randomForest and inTrees packages called
rfRules was added. A basic random forest model is used and then is decomposed into rules (of user-specified complexity). The inTrees package is used to prune and optimize the rules. Thanks to Mirjam Jenny who suggested the workflow.
When specifying your own resampling indices, a value of
method = "custom" can be used with
trainControl for better printing.
Tim Lucas fixed a bug in
bag = TRUE
Fixed a bug found by
method = "dnn" with classification.
A new option called
sampling was added to
trainControl that allows users to subsample their data in the case of a class imbalance. Another help page was added to explain the features.
Class probabilities can be computed for
extraTrees models now.
When PCA pre-processing is conducted, the variance trace is saved in an object called
More error traps were added for common mistakes (e.g. bad factor levels in classification).
An internal function (
class2ind) that can be used to make dummy variables for a single factor vector is now documented and exported.
A bug was fixed in the
xyplot.lift where the reference line was incorrectly computed. Thanks to Einat Sitbon for finding this.
A bug related to calculating the Box-Cox transformation found by John Johnson was fixed.
EdwinTh developed a faster version of
findCorrelation and found a bug in the original code.
findCorrelation has two new arguments, one of which is called
exact which defaults to use the original (fixed) function. Using
exact = FALSE uses the faster version. The fixed version of the "exact" code is, on average, 26-fold slower than the current version (for 250x250 matrices) although the average time for matrices of this size was only 26s. The exact version yields subsets that are, one average, 2.4 percent smaller than the other versions. This difference will be more significant for smaller matrices. The faster ("approximate") version of the code is 8-fold faster than the current version.
slyuee found a bug in the
gam model fitting code.
Chris Kennedy fixed a bug in the
bartMachine variable importance code.
CHAID from the R-Forge package CHAID
xgbLinear from the
xgboost package were added. That package is not on CRAN and can be installed from github using the devtools package and
rbf models for regression.
A summary function for the multinomial likelihood called
mnLogLoss was added.
The total object size for
preProces objects that used bagged imputation was reduced almost 5-fold.
A new option to
trim was added where, if implemented, will reduce the model's footprint. However, features beyond simple prediction may not work.
A rarely occurring bug in
gbm model code was fixed (thanks to Wade Cooper)
splom.resamples now respects the
A new argument to
cuts was added to allow more control over what thresholds are used to calculate the curve.
cuts argument of
calibration now accepts a vector of cut points.
Jason Schadewald noticed and fixed a bug in the man page for
Call objects were removed from the following models:
An argument was added to
createTimeSlices to thin the number of resamples
The RFE-related functions
gamFuncs were updated so that
rfe accepts a matrix
Using the default grid generation with
glmnet, an initial
glmnet fit is created with
alpha = 0.50 to define the
train models for
"gamLoess" now allow their respective arguments for the outcome probability distribution to be passed to the underlying function.
A bug in
print.varImp.train was fixed.
train now returns an additional column called
rowIndex that is exposed when calling the summary function during resampling.
The ability to compute class probabilities was removed from the
rpartCost model since they are unlikely to agree with the class predictions.
extractProb no longer redundantly calls
extractPrediction to generate the class predictions.
A new function called
var_seq was added that finds a sequence of integers that can be useful for some tuning parameters such as random forests
mtry. Model modules were update to use the new function.
n.minobsinnode was added as a tuning parameter to
For models using out-of-bag resampling,
train now properly checks the
metric argument against the names of the measured outcomes.
createFolds were modified to better handle cases where one or more class have very low numbers of data points.
The license was changed to GPL (>= 2) to accommodate new code from the GA package.
New feature selection functions
safs were added, along with helper functions and objects, were added. The package HTML was updated to expand more about feature selection.
From the adabag package, two new models were added:
Weighted subspace random forests from the wsrf package was added.
Additional bagged FDA and MARS models were added (model codes
bagEarthGCV) were added that use the GCV statistic to prune the model. This leads to memory reductions during training.
The model code for
ada had a bug fix applied and the code was adapted to use the "sub-model trick" so it should train faster.
A bug was fixed related to imputation when the formula method is used with
drop = FALSE bug was fixed in
A bug was fixed for custom models with no labels.
A bug fix was made for bagged MARS models when predicting probabilities.
train, the argument
last was being incorrectly set for the last model.
Reynald Lescarbeau refactored
findCorrelation to make it faster.
The apparent performance values are not reported by
print.train when the bootstrap 632 estimate is used.
When a required package is missing, the code stops earlier with a more explicit error message.
Brenton Kenkel added ordered logistic or probit regression to
method = "polr" from MASS
LPH07_1 now encodes the noise variables as binary
sbf get arguments for
indexOut for their control functions.
A reworked version of
nearZerVar based on code from Michael Benesty was added the old version is now called
nzv that uses less memory and can be used in parallel.
The multi-class discriminant model using binary predictors in the binda package was added.
Ensembles of partial least squares models (via the enpls) package was added.
A bug using
gbm with Poisson data was fixed (thanks to user eriklampa)
sbfControl now has a
multivariate option where all the predictors are exposed to the scoring function at once.
compare_models was added that is a simple comparison of models via
The row names for the
variables component of
rfe objects were simplified.
Philipp Bergmeir found a bug that was fixed where
bag would not run in parallel.
predictionBounds was not implemented during resampling.
A few bug fixes to
preProcess were made related to KNN imputation.
The parameter labels for polynomial SVM models were fixed
The tags for
dnn models were fixed.
The following functions were removed from the package:
PLS. The original code and the man files can be found at https://github.com/topepo/caret/tree/master/deprecated.
A number of changes to comply with section 126.96.36.199 of "Writing R Extensions" were made.
For the input data
train, we now respect the class of the input value to accommodate other data types (such as sparse matrices). There are some complications though; for pre-processing we throw a
warning if the data are not simple matrices or data frames since there is some infrastructure that does not exist for other classes( e.g.
complete.cases). We also throw a warning if
returnData <- TRUE and it cannot be converted to a data frame. This allows the use of sparse matrices and text corpus to be used as inputs into that function.
plsRglm was added.
From the frbs, the following rule-based models were added:
WM. Thanks to Lala Riza for suggesting these and facilitating their addition to the package.
From the kernlab package, SVM models using string kernels were added:
update.rfe was added.
cluster.resamples was added to the namespace.
An option to choose the
metric was added to
prcomp.resamples now passed
prcomp. Also the call to
prcomp uses the formula method so that
na.action can be used.
resamples was enhanced so that
rfe models that used
returnResamp="all" subsets the resamples to get the appropriate values and issues a warning. The function also fills in missing model names if one or more are not given.
Several regression simulation functions were added:
print.train was re-factored so that
format.data.frame is now used. This should behave better when using knitr.
The error message in
train.formula was improved to provide more helpful feedback in cases where there is at least one missing value in each row of the data set.
ggplot.train was modified so that groups are distinguished by color and shape.
Options were added to
nameInStrip that will print the name and value of any tuning parameters shown in panels.
A bug was fixed by Jia Xu within the knn imputation code used by
A missing piece of documentation in
trainControl for adaptive models was filled in.
A warning was added to
ggplot.train to note that the relationship between the resampled performance measures and the tuning parameters can be deceiving when using adaptive resampling.
A check was added to
trainControl to make sure that a value of
min makes sense when using adaptive resampling.
A man page with the list of models available via
train was added back into the package. See
Thoralf Mildenberger found and fixed a bug in the variable importance calculation for neural network models.
The output of
pamr models was updated to clarify the ordering of the importance scores.
getModelInfo was updated to generate a more informative error message if the user looks for a model that is not in the package's model library.
A bug was fixed related to how seeds were set inside of
"parRF" (parallel random forest) was added back into the library.
When case weights are specified in
train, the hold-out weights are exposed when computing the summary function.
A check was made to convert a
data.table given to
train to a data frame (see http://stackoverflow.com/questions/23256177/r-caret-renames-column-in-data-table-after-training).
Changes were made that stopped execution of
train if there are no rows in the data (changes suggested by Andrew Ziem)
Andrew Ziem also helped improve the documentation.
Several models were updated to work with case weights.
A bug in
rfe was found where the largest subset size have the same results as the full model. Thanks to Jose Seoane for reporting the bug.
For some parallel processing technologies, the package now export more internal functions.
A bug was fixed in
rfe that occurred when LOO CV was used.
Another bug was fixed that occurred for some models when
tuneGrid contained only a single model.
A new system for user-defined models has been added. See http://caret.r-forge.r-project.org/custom_models.html.
When creating the grid of tuning parameter values, the column names no longer need to be preceded by a period. Periods can still be used as before but are not required. This isn't guaranteed to break backwards compatibility but it may in some cases.
trainControl now has a
method = "none" resampling
option that bypasses model tuning and fits the model to the entire
training set. Note that if more than one model is specified an error
logicForest models were removed since the package is
RSimca models from the rrcovHD
package were added.
elm from the elmNN
package was added.
rknnBel from the rknn
package were added
brnn from the brnn
package was added.
xyplot.lift now have an argument
values that show the percentages of samples found for
the specified percentages of samples tested.
sbf should no longer throw
a warning that "executing
ggplot method for
train was added.
Imputation via medians was added to
preProcess by Zachary Mayer.
A small change was made to
rpart models. Previously, when the
final model is determined, it would be fit by specifying the model using the
cp argument of
rpart.control. This could lead to duplicated Cp
values in the final list of possible Cp values. The current version fits the
final model slightly different. An initial model is fit using
cp = 0
then it is pruned using
prune.rpart to the desired depth. This
shouldn't be different for the vast majority of data sets. Thanks to Jeff
Evans for pointing this out.
The method for estimating sigma for SVM and RVM models was slightly
changed to make them consistent with how
rvm does the
The default behavior for
sbfControl is now
returnResamp = "final".
cluster was added as a general class with a specific method
The refactoring of model code resulted in a number of packages being eliminated from the depends field. Additionally, a few were moved to exports.
A bug in
spatialSign was fixed for data frames with
a single column.
Pre-processing was not applied to the training data set prior to grid creation. This is now done but only for models that use the data when defining the grid. Thanks to Brad Buchsbaum for finding the bug.
Some code was added to
rfe to truncate the subset
sizes in case the user over-specified them.
A bug was fixed in
gamFuncs for the
sbfControl were added so that the user can set the
seed at each resampling iteration (most useful for parallel
processing). Thanks to Allan Engelhardt for the recommendation.
Some internal refactoring of the data was done to prepare for some upcoming resampling options.
predict.train now has an explicit
argument defaulted to
na.omit. If imputation is used in
na.action = na.pass is recommended.
A bug was fixed in
dummyVars that occured when
missing data were in
newdata. The function
contr.dummy is now deprecated and
should be used (if you are using it at all). Thanks to
stackexchange user mchangun for finding the bug.
A check is now done inside
levelsOnly = TRUE to see if any predictors share common
A new option
fullRank was added to
contr.treatment is used. Otherwise,
contr.ltfr is used.
A bug in
train was fixed with
(thanks to stackoverflow user screechOwl for finding it).
protoclass function in the protoclass
package was added. The model uses a distance matrix as input and
train method also uses the proxy package to
compute the distance using the Minkowski distance. The two tuning
parameters is the neighborhood size (
eps) and the Minkowski
distance parameter (
A bug was (hopefully) fixed that occurred when some type of
parallel processing was used with
train. The problem is
methods package was not being loaded in the workers.
While reproducible, it is unknown why this occurs and why it is
only for some technologies and systems. The
is now a formal dependency and we coerce the workers to load it
A bug was fixed where some calls were printed twice.
versions of these models for two classes were added to
The method values are
The prediction code for the
ksvm models was changed. There
are some cases where the class predictions and the predicted class
probabilities disagree. This usually happens when the probabilities are
close to 0.50 (in the two class case). A kernlab bug has been
filed. In the meantime, if the
ksvm model uses a probability
model, the class probabilities are generated first and the predicted
class is assigned to the probability with the largest value. Thanks to
Kjell Johnson for finding that one.
print.train was changed so that tune parameters that are
logicals are printed well.
Added a few exemptions to the logic that determines whether a model call should be scrubbed.
An error trap was created to catch issues with missing importance scores in
twoClassSim was added for benchmarking classification models.
A bug was fixed in
predict.nullModel related to predicted class probabilities.
The version requirement for gbm was updated.
getTrainPerf was made visible.
The automatic tuning grid for
sda models from the sda package was changed to include
randomForests is used with
tuneLength == 1, the
randomForests default value for
mtry is used.
Maximum uncertainty linear discriminant analysis (
Mlda) and factor-based linear discriminant analysis (
RFlda) from the HiDimDA package were added to
Added the Yeo-Johnson power transformation from the car
package to the
train bug was fixed for the
rrlda model (found
by Tiago Branquinho Oliveira).
extraTrees model in the extraTrees package was
kknn.train model in the kknn package was
A bug was fixed in
lrFuncs where the class threshold was
improperly set (thanks to David Meyer).
The old funciton
getTrainPerf was finally made visible.
Some models are created using "do.call" and may contain the entire data set in the call object. A function to "scrub" some model call objects was added to reduce their size.
The tuning process for
sda:::sda models was changed to
A bug in
predictors.earth, discovered by Katrina Bennett,
A bug induced by version 5.15-052 for the bootstrap 632 rule was fixed.
The DESCRIPTION file as of 5.15-048 should have used a version-specific lattice dependency.
lift can compute gain and lift charts (and defaults to
The gbm model was updated to handle 3 or more classes.
For bagged trees using ipred, the code in
keepX = FALSE to save space. Pass in
TRUE to use out-of-bag sampling for this model.
Changes were made to support vector machines for classification
models due to bugs with class probabilities in the latest version of
prob.model will default to the value of
classProbs in the
trControl function. If
prob.model is passed in as an argument to
specification over-rides the default. In other words, to avoid
generating a probability model, set either
classProbs = FALSE
prob.model = FALSE.
bayesglm from the arm package.
A few bugs were fixed in
bag, thanks to Keith
Woolner. Most notably, out-of-bag estimates are now computed when the
prediction function includes a column called
Parallel processing was implemented in
avNNet, which can be turned off using an optional arguments.
avNNet were given an additional argument in their respective
control files called
allowParallel that defaults to
Code, the code will be executed in parallel
if a parallel backend (e.g. doMC) is registered. When
allowParallel = FALSE, the parallel backend is always
ignored. The use case is when
train. If a parallel backend with P processors is being used,
the combination of these functions will create P^2 processes. Since
some operations benefit more from parallelization than others, the
user has the ability to concentrate computing resources for specific
A new resampling function called
contributed by Tony Cooper that generates cross-validation indices for
time series data.
A few more options were added to
fixedWindow are applicable for when
indexOut is an optional list of
resampling indices for the hold-out set. By default, these values are
the unique set of data points not in the training set.
A bug was fixed in multiclass
glmnet models when
generating class probabilities (thanks to Bradley Buchsbaum for
The three vignettes were removed and two things were added: a smaller vignette and a large collection of help pages at http://caret.r-forge.r-project.org/.
Minkoo Seo found a bug where
na.action was not being properly
set with train.formula().
parallel.resamples was changed to properly account for
Some testing code was removed from
Fixed a bug in
sbf exposed by a new version of plyr.
To be more consistent with recent versions of lattice,
parallel.resamples function was changed to
ksvm now allows probabilities when class weights
are used, the default behavior in
train is to set
prob.model = TRUE unless the user explicitly sets it to
FALSE. However, I have reported a bug in
ksvm that gives
inconsistent results with class weights, so this is not advised at
this point in time.
Bugs were fix in
rfeControl(saveDetails = TRUE) or
sbfControl(saveDetails = TRUE) an additional column is
rowIndex. This indicates the
row from the original data that is being held-out.
A bug was fixed that induced
NA values in SVM model predictions.
Many examples are wrapped in dontrun to speed up cran checking.
scrda methods were removed from the package (on
6/30/12, R Core sent an email that "since we haven't got fixes for
long standing warnings of the rda packages since more than half a year
now, we set the package to ORPHANED.")
C50 was added (model codes
Fixed a bug in
train with NaiveBayes when
fL != 0
The output of
verboseIter = TRUE was
modified to show the resample label as well as logging when the worker
started and stopped the task (better when using parallel processing).
Added a long-hidden function
downSample for class imbalances
upSample function was added for class imbalances.
A new file, aaa.R, was added to be compiled first that tries to eliminate the dreaded 'no visible binding for global variable' false positives. Specific namespaces were used with several functions for avoid similar warnings.
A bug was fixed with
icr.formula that was so ridiculous,
I now know that nobody has ever used that function.
Fixed a bug when using
method = "oob" with
Some exceptions were added to
plot.train so that some
tuning parameters are better labeled.
bwplot.resamples now order
the models using the first metric.
A few of the lattice plots for the
resamples class were
changed such that when only one metric is shown: the strip is not
shown and the x-axis label displays the metric
trainControl(savePredictions = TRUE) an
additional column is added to
rowIndex. This indicates the row from the original data that is
A variable importance function for
nnet objects was
created based on Gevrey, M., Dimopoulos, I., & Lek, S. (2003). Review
and comparison of methods to study the contribution of variables in
artificial neural network models. ecological modelling, 160(3),
predictor function for
glmnet was update and a
variable importance function was also added.
Raghu Nidagal found a bug in
predict.avNNet that was
specificity were given an
A first attempt at fault tolerance was added to
a model fit fails, the predictions are set to
NA and a warning
is issued (eg "model fit failed for Fold04: sigma=0.00392,
verboseIter = TRUE, the warning is also printed
to the log. Resampled performance is calculated on only the
non-missing estimates. This can also be done during predictions, but
must be done on a model by model basis. Fault tolerance was added for
kernlab models only at this time.
lift was modified in two ways. First,
cuts is no
longer an argument. The function always uses cuts based on the number
of unique probability estimates. Second, a new argument called
label is available to use alternate names for the models
(e.g. names that are not valid R variable names).
A bug in
print.bag was fixed.
Class probabilities were not being generated for sparseLDA models.
Bugs were fixed in the new varImp methods for PART and RIPPER
A set of functions for RFE and logistic regression
lrFuncs) was added.
A bug in
method="glmStepAIC" was fixed
direction and other
stepAIC arguments were
A bug was fixed in
preProcess where the number of ICA
components was not specified. (thanks to Alexander Lebedev)
Another bug was fixed for oblique random forest methods in
train. (thanks to Alexander Lebedev)
The list of models that can accept factor inputs directly was
expanded to include the RWeka models,
cforest and custom models.
lda2, which tunes by the number of functions
used during prediction.
predict.train allows probability predictions for custom
models now (thanks to Peng Zhang)
confusionMatrix.train was updated to use the default
confusionMatrix code when
norm = "none" and only a
single hold-out was used.
Added variable importance metrics for PART and RIPPER in the RWeka package.
vignettes were moved from /inst/doc to /vignettes
The model details in
?train was changed to be more
Added two models from the RRF package.
RRF uses a
penalty for each predictor based on the scaled variable importance
scores from a prior random forest fit.
RRFglobal sets a common,
global penalty across all predictors.
Added two models from the KRLS package:
krlsPoly. Both have kernel parameters (
degree) and a common regularization parameter
lambda. The default for
NA, letting the
krls function estimate it internally.
lambda can also be
twoClassSummary was modified to wrap the call to
pROC:::roc in a
try command. In cases where the hold-out
data are only from one class, this produced an error. Now it generates
NA values for the AUC when this occurs and a general warning is
The underlying workflows for
train were modified so that
missing values for performance measures would not throw an error (but
will issue a warning).
rbfDDA were added from RSNNS.
met their end. The cake was a lie.
This NEWS file was converted over to Rd format.
lift was expanded into
for calculating the plot points and
create the plot.
The package vignettes were altered to stop loading external RData files.
match.call changes were made to pass new R CMD
xyplot.calibration were created to make probability
bdk from the kohonen
package were added.
update.train was added so that tuning parameters
can be manually set if the automated approach to setting their
values is insufficient.
method = "pls" in
plsr function used the default PLS algorithm
("kernelpls"). Now, the full orthogonal scores method is used. This
results in the same model, but a more extensive set of values are
calculated that enable VIP calculations (without much of a loss in
A check was added to
preProcess to ensure valid
method were used.
A new method,
kernelpls, was added.
summary methods were added to
train objects that pass the final model to their
Bugs were fixed that prevented hold-out predictions from being returned.
A bug in
roc was found when the classes were completely
The ROC calculations for
filterVarImp were changed to use the pROC
package. This, and other changes, have increased efficiency. For
filterVarImp on the cell segmentation data lead to a
54-fold decrease in execution time. For the Glass data in the
mlbench package, the speedup was 37-fold. Warnings were
rocPoint regarding their deprecation.
random ferns (package rFerns) were added
Another sparse LDA model (from the penalizedLDA) was also added
Fixed a bug which occurred when
plsda models were used with class
As of 8/15/11, the
glmnet function was
updated to return a character vector. Because of this,
train required modification and a version requirement
was put in the package description file.
Shea X made a suggestion and provided code to improve the speed
of prediction when sequential parameters are used for
Andrew Ziem suggested an error check with
metric = "ROC" and
classProbs = FALSE.
Andrew Ziem found a bug in how
earth class probabilities
Andrew Ziem found another small bug with parallel processing and
train (functions in the caret namespace cannot be found).
Ben Hoffman found a bug in
pickSizeTolerance that was fixed.
Jiaye Yu found (and fixed) a bug in getting predictions back from
saveDetails = TRUE in
rfeControl will save the predictions on the hold-out
sets (Jiaye Yu wins the prize for finding that one).
trainControl now has a logical to save the hold-out predictions.
type = "prob" was added for
A bug was fixed where the extrapolation limits were being
predict.train but not in
extractPrediction. Thanks to Antoine Stevens for
changed calls to
predict.mvr since the pls package now has a
a beta version of custom models with
train is included. The
"caretTrain" vignette was updated with a new section that defines
how to make custom models.
laying some of the groundwork for custom models
updates to get away from deprecated (mean and sd on data frames)
The pre-processing in
train bug of the last
version was not entirely squashed. Now it is.
panel.lift was moved out of the examples in
?lift and into the
package along with another function,
lift now uses
panel.lift2 by default
Added robust regularized linear discriminant analysis from the rrlda package
evtree from evtree
A weird bug was fixed that occurred when some models were run with sequential parameters that were fixed to single values (thanks to Antoine Stevens for finding this issue).
item Another bug was fixed where pre-processing with
train could fail
train did not occur for the final model fit
lift, was added to create lattice
objects for lift plots.
Several models were added from the obliqueRF package: 'ORFridge' (linear combinations created using L2 regularization), 'ORFpls' (using partial least squares), 'ORFsvm' (linear support vector machines), and 'ORFlog' (using logistic regression). As of now, the package only support classification.
Added regression models
widekernelpls. These are new models since both
plsr have an argument
method, so the computational algorithm could not be
passed through using the three dots.
rpart was added that uses
cp as the tuning
parameter. To make the model codes more consistent,
ctree correspond to the nominal tuning parameters
mincriterion, respectively) and
ctree2 are the alternate versions using
The text for
ctree's tuning parameter was changed to '1 -
controls was not being properly passed
through in models
controls was not being set properly for
The print methods for
sbf did not recognize LOOCV
avNNet sometimes failed with categorical outcomes with
bag = FALSE
A bug in
preProcess was fixed that was triggered by matrices without
dimnames (found by Allan Engelhardt)
bagged MARS models with factor outcomes now work
cforest was using the argument
control instead of
A few bugs for class probabilities were fixed for
When looping over models and resamples, the foreach
package is now being used. Now, when using parallel processing, the
caret code stays the same and parallelism is invoked using
one of the "do" packages (eg. doMC, doMPI, etc). This
sbf. Their respective man pages have been revised to
illustrate this change.
The order of the results produced by
defaultSummary were changed
so that the ROC AUC is first
A few man and C files were updated to eliminate R CMD check warnings
Now that we are using foreach, the verbose option in
sbfControl are now defaulted to
rfe now returns the variable ranks in a single data frame (previously
there were data frames in lists of lists) for each of use. This will
will break code from previous versions. The built-in RFE functions
were also modified
confusionMatrix methods for
sbf were added
NULL values of 'method' in
preProcess are no longer allowed
a model for ridge regression was added (
method = 'ridge') based on
A bug was fixed in a few of the bagging aggregation functions (found by Harlan Harris).
Fixed a bug spotted by Richard Marchese Robinson in createFolds
when the outcome was numeric. The issue is that
createFolds is trying to randomize
k folds. With less than 40 samples, it could not
always do this and would generate less than
k folds in some
cases. The change will adjust the number of groups based on
k. For small samples sizes, it will not use
stratification. For larger data sets, it will at most group the
data into quartiles.
confusionMatrix.train was added to get an average
confusion matrices across resampled hold-outs when using the
train function for classification.
Added another model,
avNNet, that fits several neural networks
via the nnet package using different seeds, then averages the
predictions of the networks. There is an additional bagging
The default value of the 'var' argument of
bag was changed.
As requested, most options can be passed from
trainControl function was re-factored and several
thresh) were combined into a single
list option called
preProcOptions. The default is consistent
with the original configuration:
preProcOptions = list(thresh
= 0.95, ICAcomp = 3, k = 5)
nother option was added to
option can be used to set exactly how many components are used
(as opposed to just a threshold). It defaults to
NULL so that
the threshold method is still used by default, but a non-null
When created within
train, the call for
preProcess is now
modified to be a text string ("scrubed") because the call could
be very large.
Removed two deprecated functions:
A new version of the cell segmentation data was saved and the
original version was moved to the package website (see
segmentationData for location). First, several
discrete versions of some of the predictors (with the suffix
"Status") were removed. Second, there are several skewed
predictors with minimum values of zero (that would benefit from
some transformation, such as the log). A constant value of 1 was
added to these fields:
Some tweaks were made to
plot.train in a effort to get the group
key to look less horrid.
now able to estimate the time that these models take to predict new
samples. Their respective control objects have a new option,
timingSamps, that indicates how many of the training set samples
should be used for prediction (the default of zero means do not
estimate the prediction time).
xyplot.resamples was modified. A new argument,
what, has values:
"scatter" plots the resampled
performance values for two models;
"BlandAltman" plots the
difference between two models by the average (aka a MA plot) for two
"pTime" plot the total
model building and tuning; time (
"t") or the final model
building time (
"m") or the time to produce predictions
"p") against a confidence interval for the average
performance. 2+ models can be used.
Three new model types were added to
regsubsets in the leaps package:
nvmax, is the maximum number of terms in the
The seed was accidentally set when
preProcess used ICA (spotted
by Allan Engelhardt)
preProcess was always being called (even to do nothing)
(found by Guozhu Wen)
Added a few new models associated with the bst package: bstTree, bstLs and bstSm.
A model denoted as
"M5" that combines M5P and M5Rules from the
RWeka package. This new model uses either of these functions
depending on the tuning parameter
Fixed a bug with
method = "penalized". Thanks to
Fedor for finding it.
A new tuning parameter was added for
M5Rules controlling smoothing.
The Laplace correction value for Naive Bayes was also added as a tuning parameter.
varImp.RandomForest was updated to work. It now requires a recent
version of the party package.
A variable importance method was created for Cubist models.
Altered the earth/MARS/FDA labels to be more exact.
Added cubist models from the Cubist package.
A new option to
trainControl was added to allow
users to constrain the possible predicted values of the model to the
range seen in the training set or a user-defined range. One-sided
ranges are also allowed.
Two typos fixed in
print.sbf (thanks to Jan Lammertyn)
dummyVars failed with formulas using
all.vars does not handle this well)
tree2 was failing for some classification models
When SVM classification models are used with
prob.model is automatically set to
FALSE (otherwise, it
is always set to
TRUE). A warning is issued that the model will
not be able to create class probabilities.
Also for SVM classification models, there are cases when the probability model generates negative class probabilities. In these cases, we assign a probability of zero then coerce the probabilities to sum to one.
Several typos in the help pages were fixed (thanks to Andrew Ziem).
Added a new model,
svmRadialCost, that fits the SVM model
and estimates the
sigma parameter for each resample (to
properly capture the uncertainty).
preProcess has a new method called
"range" that scales the predictors
to [0, 1] (which is approximate for new samples if the training set
ranges is narrow in comparison).
A check was added to
train to make sure that, when the user passes
a data frame to
tuneGrid, the names are correct and complete.
print.train prints the number of classes and levels for classification
Added a few bagging modules. See ?bag.
Added basic timings of the entire call to
as well as the fit time of the final model. These are stored in an element
The data files were updated to use better compression, which added a higher R version dependency.
plot.train was pretty much re-written to more effectively use trellis theme
defaults and to allow arguments (e.g. axis labels, keys, etc) to be passed
in to over-ride the defaults.
Bug fix for lda bagging function
Bug fix for
predict.BoxCoxTrans would go all klablooey if there were missing
varImp.rpart was failing with some models (thanks to Maria Delgado)
A new class was added or estimating and applying the Box-Cox
transformation to data called BoxCoxTrans. This is also included as an
option to transform predictor variables. Although the Box-Tidwell
transformation was invented for this purpose, the Box-Cox transformation
is more straightforward, less prone to numerical issues and just as
effective. This method was also added to
Fixed mis-labelled x axis in
plot.train when a
transformation is applied for models with three tuning parameters.
When plotting a
train object with
"gbm" and multiple values of the shrinkage parameter, the ordering of
panels was improved.
Fixed bugs for regression prediction using
Another bug, reported by Jan Lammertyn, related to
extractPrediciton with a single predictor was also
Fixed a bug where linear SVM models were not working for classification
'gcvEearth' which is the basic MARS model. The pruning procedure
is the nominal one based on GCV; only the degree is tuned by
'qrnn' for quantile regression neural networks from the qrnn package.
'Boruta' for random forests models with feature selection via the
Some changes to
print.train: the call is not automatically
printed (but can be when
print.train is explicitly invoked); the
"Selected" column is also not automatically printed (but can be);
non-table text now respects
options("width"); only significant
digits are now printed when tuning parameters are kept at a
Bug fixes to
preProcess related to complete.cases and a single predictor.
For knn models (knn3 and knnreg), added automatic conversion of data frames to matrices
A new function for
rfe with gam was added.
"Down-sampling" was implemented with
bag so that, for
classification models, each class has the same number of classes
as the smallest class.
Added a new class,
dummyVars, that creates an entire set of
binary dummy variables (instead of the reduced, full rank set).
The initial code was suggested by Gabor Grothendieck on R-Help.
The predict method is used to create dummy variables for any
RMSE functions for evaluating regression models
varImp.gam failed to recognize objects from mgcv
a small fix to test a logical vector
diff.resamples calculated the number of comparisons,
"models" argument was ignored.
predict.bag was ignoring
type = "prob"
Minor updates to conform to R 2.13.0
Added a warning to
train when class levels are not
valid R variable names.
Fixed a bug in the variable importance function for
Added p-value adjustments to
summary.diff.resamples. Confidence intervals in
dotplot.diff.resamples are adjusted accordingly if the
Bonferroni is used.
dotplot.resamples, no point was plotted when
the upper and/or lower interval values were NaN. Now, the point is
plotted but without the interval bars.
print.rfe to correctly describe new
Fixed a bug in
predict.rfe where an error was
thrown even though the required predictors were in
preProcess so that centering and scaling are both automatic
when PCA or ICA are requested.
Added two functions,
checkConditionalX that identify predictor data with
degenerate distributions when conditioned on a factor.
Added a high content screening data set (
segmentedData) from Hill et
al. Impact of image segmentation on high-content screening data quality
for SK-BR-3 cells. BMC bioinformatics (2007) vol. 8 (1) pp. 340.
Fixed bugs in how
sbf objects were printed (when using repeated
CV) and classification models with earth and
classProbs = TRUE.
Added imputation using bagged regression trees to
Fixed bug in
varImp.rfe that caused incorrect
results (thanks to Lawrence Mosley for the find).
Fixed a bug where
train would not allow knn imputation.
roc now check for missing values and
use complete data for each predictor (instead of case-
wise deletion across all predictors).
Fixed bug introduced in the last version with
createDataPartition(... list = FALSE).
Fixed a bug predicting class probabilities when using earth/glm models
Fixed a bug that occurred when
train was used with
Fixed bugs in
sbf when running in
parallel; not all the resampling results were saved
A p-value from McNemar's test was added to
print.train so that constant parameters are not
shown in the table (but a note is written below the table
instead). Also, the output was changed slightly to be
more easily read (I hope)
Expanded the tuning parameters for
Some of the examples in the Model Building vignette were changed
Added bootstrap 632 rule and repeated cross-validation
A new function,
used to generate indices for repeated CV.
The various resampling functions now have *named* lists as output (with prefixes "Fold" for cv and repeated cv and "Resample" otherwise)
Pre-processing has been added to
train with the
preProcess argument. This has been tested when caret
function are used with
preProcess(method = "spatialSign"), centering and
scaling is done automatically too. Also, a bug was fixed
that stopped the transformation from being executed.
knn imputation was added to
preProcess. The RANN package
is used to find the neighbors (the knn impute function in
the impute library was consistently generating segmentation
faults, so we wrote our own).
Changed the behavior of
preProcess in situations where
scaling is requested but there is no variation in the
predictor. Previously, the method would fail. Now a
warning is issued and the value of the standard
deviation is coerced to be one (so that scaling has
gam from mgcv (with smoothing splines and feature
gam from gam (with basic splines and loess)
smoothers. For these models, a formula is derived
from the data where "near zero variance" predictors
nearZerVar) are excluded and predictors with
less than 10 distinct values are entered as linear
(i.e. unsmoothed) terms.
Changed earth fit for classification models to use the
glm argument with a binomial family.
varImp.multinom, which is based on the absolute
values of the model coefficients
The feature selection vignette was updated slightly (again).
sbf to include class probabilities
in performance calculations.
Also, the names of the resampling indices were harmonized
The feature selection vignette was updated slightly.
Added the ability to include class probabilities in
performance calculations. See
Updated and restructured the main vignette.
Internal changes related to how predictions from models are stored and summarized. With the exception of loo, the model performance values are calculated by the workers instead of the main program. This should reduce i/o and lay some groundwork for upcoming changes.
The default grid for relaxo models were changed based on and initial model fit.
partDSA model predictions were modified; there were cases where the user might request X partitions, but the model only produced Y < X. In these cases, the partitions for missing models were replaced with the largest model that was fit.
modelLookup was put in the namespace and
a man file was added.
The names of the resample indices are automatically reset, even if the user specified them.
Fixed a bug generated a few versions ago where
fda objects crashed.
When computing the scale parameter for RBF kernels, the
option to automatically scale the data was changed to
logic.bagging in logicFT with
method = "logicBag"
Fixed a bug in
varImp.train related to nearest shrunken
Added logic regression and logic forests
Added an option to
splom.resamples so that the variables in the
scatter plots are models or metrics.
dotplot.resamples plus acknowledgements to Hothorn et al.
(2005) and Eugster et al. (2008)
tuneGrid option to allow a function
to be passed in.
prcomp method for the
resamples to work with
Cleaned up some of the man files for the resamples class
Fixed a bug in
not being passed to the test statistic function.
Added more log messages in
train when running verbose.
Added the German credit data set.
Added a general framework for bagging models via the
bag function. Also, model type
"hdda" from the
HDclassif package was added.
Tthe resampling estimate of the standard deviation given
train since v 4.39 was wrong
A new field was added to
"estimate". In cases where the mvr model had multiple
estimates of performance (e.g. training set, CV, etc) the user can
now select which estimate they want to be used in the importance
calculation (thanks to Sophie Bréand for finding this)
predict.sbf and modified the structure of
sbf helper functions. The
only computes the metric used to filter and the filter function does
the actual filtering. This was changed so that FDR corrections or
other operations that use all of the p-values can be computed.
Also, the formatting of p-values in
An argument was added to
so that the variable name is returned instead of the index.
Independent component analysis was added to the list of pre-processing operations and a new model ("icr") was added to fit a pcr-like model with the ICA components.
hda and cleaned up the caret training vignette
Added several classes for examining the resampling results. There are methods for estimating pair-wise differences and lattice functions for visualization. The training vignette has a new section describing the new features.
Added partDSA and
stepAIC for linear models and
generalized linear models
Fixed a new bug in how resampling results are exported
Added penalized linear models from the foba package
rocc classification and fixed a typo.
Added two new data sets:
Added GAMens (ensembles using gams)
Fixed a bug in
roc that, for some data cases, would reverse the "positive"
class and report sensitivity as specificity and vice-versa.
Added a parallel random forest method in
train using the foreach package.
Also added penalized logistic regression using the
plr function in the
Added a new feature selection function,
sbf (for selection by filter).
Fixed bug in
rfe that did not affect the results, but did produce
A new model function,
nullModel, was added. This model fits either the
mean only model for regression or the majority class model for classification.
Also, ldaFuncs had a bug fixed.
Minor changes to Rd files
For whatever reason, there is now a function in the spls package by the name of splsda that does the same thing. A few functions and a man page were changed to ensure backwards compatibility.
Added stepwise variable selection for
qda using the
stepclass function in klaR
Added robust linear and quadratic discriminant analysis functions from rrcov.
Also added another column to the output of
saves the name of the model object so that you can have multiple
models of the same type and tell which predictions came from which
Changes were made to
plotClassProbs: new parameters were added
and densityplots can now be produced.
Fixed a bug in
caretFunc that led to NaN variable rankings, so
that the first k terms were always selected.
Added parallel processing functionality for
Added the ability to use custom metrics with
Many Rd changes to work with updated parser.
Re-saved data in more compressed format
pcr as a method
Weights argument was added to
train for models that accept weights
Also, a bug was fixed for lasso regression (wrong lambda specification) and other for prediction in naive Bayes models with a single predictor.
Fixed bug in new
nearZeroVar and updated
format.earth so that it
does not automatically print the formula
Added a new version of
nearZeroVar from Allan Engelhardt that is
Fixed bugs in
extractProb (for glmnet) and
For glmnet, the user can now pass in their own value of family to
train will set it depending on the mode of the
outcome). However, glmnet doesn't have much support for families at
this time, so you can't change links or try other distributions.
Fixed bug in
createFolds when the smallest y value is more than 25
of the data
Fixed bug in
Added vbmp from vbmp package
Added additional error check to
Fixed an absurd typo in
Added: linear kernels for svm, rvm and Gaussian processes;
rlm from MASS; a knn regression model, knnreg
A set of functions (class "
classDist") to computes the class
centroids and covariance matrix for a training set for
determining Mahalanobis distances of samples to each class
centroid was added
a set of functions (
rfe) for doing recursive feature selection
(aka backwards selection). A new vignette was added for more
PART from RWeka
Fixed error in documentation for
confusionMatrix. The old doc had
"Detection Prevalence = A/(A+B)" and the new one has
"Detection Prevalence =(A+B)(A+B+C+D)". The underlying code was correct.
step as parameters)
bagEarth to allow
for classification models
Added glmnet models
Added code for sparse PLS classification.
Fix a bug in prediction for
Updated again for more stringent R CMD check tests in R-devel 2.9
Updated for more stringent R CMD check tests in R-devel 2.9
Significant internal changes were made to how the models are
fit. Now, the function used to compute the models is passed in as a
parameter (defaulting to
lapply). In this way, users can use
their own parallel processing software without new versions of
caret. Examples are given in
Also, fixed a bug where the MSE (instead of RMSE) was reported for random forest OOB resampling
There are more examples in
specificity and the predictive value functions: each was made
more generic with default and
confusionMatrix "extractor" functions for matrices and tables
were added; the pos/neg predicted value computations were changed to
incorporate prevalence; prevalence was added as an option to several
functions; detection rate and prevalence statistics were added to
confusionMatrix; and the examples were expanded in the help
This version of caret will break compatibility with caretLSF and caretNWS. However, these packages will not be needed now and will be deprecated.
Updated the man files and manuals.
Fixed bug in
resampleHist. Also added a check in the
that error trapped with
glm models and > 2 classes
glms. Also, added
varImp.bagEarth to the
sda from the sda package. There was a naming
method value for
sparseLDA was changed from "sda" to
spls from the spls package
Added caching of RWeka objects to that they can be saved to the file system and used in other sessions. (changes per Kurt Hornik on 2008-10-05)
sda from the sparseLDA package (not on
Also, a bug was fixed where the ellipses were not passed into a
few of the newer models (such as
Added the penalized model from the penalized package. In caret, it is regression only although the package allows for classification via glm models. However, it does not allow the user to pass the classes in (just an indicator matrix). Because of this, it doesn't really work with the rest of the classification tools in the package.
Added a little more formatting to
gbm, let the user over-ride the default value of the
distribution argument (brought us by Peter Tait via RHelp).
predict.preProcess so that it doesn't crash if
newdata does not have all of the variables used to originally
pre-process *unless* PCA processing was requested.
Fixed bug in
varImp.rpart when the model had only primary
Minor changes to the Affy normalization code
Changed typo in
predictors man page
Added a new class called
predictors that returns the
names of the predictors that were used in the final model.
ppr from the
Minor update to the project web page to deal with IE issues
Added the ability of
train to use custom made performance
functions so that the tuning parameters can be chosen on the basis of
things other than RMSE/R-squared and Accuracy/Kappa.
A new argument was added to
"summaryFunction" that is used to specify the function used to
compute performance metrics. The default function preserves the
functionality prior to this new version
a new argument to
train is "maximize" which is a logical
for whether the performance measure specified in the "metric"
train should be maximized or minimized.
The selection function specified in
the maximize argument with it so that customized performance
metrics can be used.
A bug was fixed in
confusionMatrix (thanks to Gabor
Another bug was fixed related to predictions from least square SVMs
superpc from the superpc package. One note:
data argument that is passed to
superpc is saved in
the object that results from
superpc.train. This is used later
in the prediction function.
slda from ipred.
Fixed a few bugs related to the lattice plots from version 3.33.
Also added the ripper (aka
JRip) and logistic model trees
stripplot.train. These are all
functions to plot the resampling points. There is some overlap between
plot.train gives the average metrics only
while these plot all of the resampled performance
resampleHist could plot all of the points, but only
for the final optimal set of predictors.
To use these functions, there is a new argument in
returnResamp which should have
values "none", "final" and "all". The default is "final" to be
consistent with previous versions, but "all" should be specified to
use these new functions to their fullest.
added to use as alternatives to the
Added C4.5 (aka
J48) and rules-based models (M5 prime) from
logitBoost from the caTools
package. This package doesn't have a namespace and RWeka has a
function with the same name. It was suggested to use the "::" prefix
to differentiate them (but we'll see how this works).