Learning algorithms create models that relate input data to the outcome. Development data is used to train models, after which they can be used to predict outcomes for new data. Familiar implements several commonly used algorithms. This vignette first describes the learners and their hyperparameters. Then, hyperparameter optimisation is described in more detail.
learner | tag | binomial | multinomial | continuous | count | survival |
---|---|---|---|---|---|---|
generalised linear models | ||||||
general^{a} | glm |
× | × | × | × | × |
logistic | glm_logistic |
× | ||||
cauchy | glm_cauchy |
× | ||||
complementary log-log | glm_loglog |
× | ||||
normal | glm_probit |
× | ||||
multinomial | glm_multinomial |
× | ||||
log-normal | glm_log ,
glm_log_gaussian |
× | ||||
normal (gaussian) | glm_gaussian |
× | ||||
inverse gaussian | glm_inv_gaussian |
× | ||||
log-poisson | glm_log_poisson |
x | × | |||
poisson | glm_poisson |
x | × | |||
cox | cox |
× | ||||
exponential | survival_regr_exponential |
× | ||||
gaussian | survival_regr_gaussian |
× | ||||
logistic | survival_regr_logistic |
× | ||||
log-logistic | survival_regr_loglogistic |
× | ||||
log-normal | survival_regr_lognormal |
× | ||||
survival regression^{a} | survival_regr |
× | ||||
weibull | survival_regr_weibull |
× | ||||
lasso regression models | ||||||
general^{a} | lasso |
× | × | × | × | × |
logistic | lasso_binomial |
× | ||||
multi-logistic | lasso_multinomial |
× | ||||
normal (gaussian) | lasso_gaussian |
× | ||||
poisson | lasso_poisson |
× | ||||
cox | lasso_cox |
× | ||||
ridge regression models | ||||||
general^{a} | ridge |
× | × | × | × | × |
logistic | ridge_binomial |
× | ||||
multi-logistic | ridge_multinomial |
× | ||||
normal (gaussian) | ridge_gaussian |
× | ||||
poisson | ridge_poisson |
× | ||||
cox | ridge_cox |
× | ||||
elastic net regression models | ||||||
general^{a} | elastic_net |
× | × | × | × | × |
logistic | elastic_net_binomial |
× | ||||
multi-logistic | elastic_net_multinomial |
× | ||||
normal (gaussian) | elastic_net_gaussian |
× | ||||
poisson | elastic_net_poisson |
× | ||||
cox | elastic_net_cox |
× | ||||
boosted linear models | ||||||
general^{a} | boosted_glm |
× | × | × | × | |
area under the curve | boosted_glm_auc |
× | ||||
cauchy | boosted_glm_cauchy |
× | ||||
complementary log-log | boosted_glm_loglog |
× | ||||
logistic | boosted_glm_logistic |
× | ||||
log-logistic | boosted_glm_log |
× | ||||
normal | boosted_glm_probit |
× | ||||
gaussian | boosted_glm_gaussian |
× | ||||
huber loss | boosted_glm_huber |
× | ||||
laplace | boosted_glm_laplace |
× | ||||
poisson | boosted_glm_poisson |
× | ||||
concordance index | boosted_glm_cindex |
× | ||||
cox | boosted_glm_cox |
× | ||||
log-logistic | boosted_glm_loglog |
× | ||||
log-normal | boosted_glm_lognormal |
× | ||||
rank-based estimation | boosted_glm_gehan |
× | ||||
survival regression^{a} | boosted_glm_surv |
× | ||||
weibull | boosted_glm_weibull |
× | ||||
extreme gradient boosted linear models | ||||||
general^{a} | xgboost_lm |
× | × | × | × | × |
logistic | xgboost_lm_logistic |
× | × | × | ||
gamma | xgboost_lm_gamma |
× | ||||
gaussian | xgboost_lm_gaussian |
× | x | |||
poisson | xgboost_lm_poisson |
× | ||||
cox | xgboost_lm_cox |
× | ||||
random forests | ||||||
random forest (RFSRC) | random_forest_rfsrc |
× | × | × | × | × |
random forest (ranger) | random_forest_ranger |
× | × | × | × | × |
boosted regression trees | ||||||
general^{a} | boosted_tree |
× | × | × | × | |
area under the curve | boosted_tree_auc |
× | ||||
cauchy | boosted_tree_cauchy |
× | ||||
complementary log-log | boosted_tree_loglog |
× | ||||
logistic | boosted_tree_logistic |
× | ||||
log-logistic | boosted_tree_log |
× | ||||
normal | boosted_tree_probit |
× | ||||
gaussian | boosted_tree_gaussian |
× | ||||
huber loss | boosted_tree_huber |
× | ||||
laplace | boosted_tree_laplace |
× | ||||
poisson | boosted_tree_poisson |
× | ||||
concordance index | boosted_tree_cindex |
× | ||||
cox | boosted_tree_cox |
× | ||||
log-logistic | boosted_tree_loglog |
× | ||||
log-normal | boosted_tree_lognormal |
× | ||||
rank-based estimation | boosted_tree_gehan |
× | ||||
survival regression^{a} | boosted_tree_surv |
× | ||||
weibull | boosted_tree_weibull |
× | ||||
extreme gradient boosted trees | ||||||
general^{a} | xgboost_tree |
× | × | × | × | × |
logistic | xgboost_tree_logistic |
× | × | × | ||
gamma | xgboost_tree_gamma |
× | ||||
gaussian | xgboost_tree_gaussian |
× | x | |||
poisson | xgboost_tree_poisson |
× | ||||
cox | xgboost_tree_cox |
× | ||||
extreme gradient boosted DART trees | ||||||
general^{a} | xgboost_dart |
× | × | × | × | × |
logistic | xgboost_dart_logistic |
× | × | × | ||
gamma | xgboost_dart_gamma |
× | ||||
gaussian | xgboost_dart_gaussian |
× | x | |||
poisson | xgboost_dart_poisson |
× | ||||
cox | xgboost_dart_cox |
× | ||||
bayesian models | ||||||
naive bayes | naive_bayes |
× | × | |||
nearest neighbour models | ||||||
k-nearest neighbours^{b} | k_nearest_neighbours_* ,
knn_* |
× | × | x | x | |
support vector machines | ||||||
\(C\)-classification^{c} | svm_c_* |
× | × | |||
\(\nu\)-classification/ regression^{c} | svm_nu_* |
× | × | × | × | |
\(\epsilon\)-regression^{c} | svm_eps_* |
× | × | |||
linear kernel^{d} | svm_*_linear |
× | × | × | × | |
polynomial kernel^{d} | svm_*_polynomial |
× | × | × | × | |
radial kernel^{d} | svm_*_radial |
× | × | × | × | |
sigmoid kernel^{d} | svm_*_sigmoid |
× | × | × | × |
Learners, their hyperparameters, and parallelisation options can be specified using the tags or arguments below:
tag / argument | description | default |
---|---|---|
learner |
The desired learner. Multiple learners may be provided at the same time. This setting has no default and must be provided. | – (required) |
hyperparameter |
Each learner has one or more hyperparameters which can be specified. Sequential model-based optimisation is used to identify hyperparameter values from the data unless these are specifically specified here. See the section on hyperparameter optimisation for more details. | – (optional) |
parallel_model_development |
Enables parallel processing for model development.
Ignored if parallel=FALSE . |
TRUE |
Generalised linear models (GLM) are easy to understand, use and share. In many situations, GLM may be preferred over more complex models as they are easier to report and validate. The most basic GLM is the linear model. The linear model for an outcome variable \(y_j\) and a single predictor variable \(x_j\) for sample \(j\) is:
\[y_j=\beta_0 + \beta_1 x_j + \epsilon_j\] Here, \(\beta_0\) and \(\beta_1\) are both model coefficients. \(\beta_0\) is also called the model intercept. \(\epsilon_j\) is an error term that describes the difference between the predicted and the actual response for sample \(j\). When a linear model is developed, the \(\beta\)-coefficients are set in such manner that the mean-squared error over the sample population is minimised.
The above model is a univariate model as it includes only a single predictor. A multivariate linear model includes multiple predictors \(\mathbf{X_j}\) and is denoted as:
\[y_j= \mathbf{\beta}\mathbf{X_j} + \epsilon_j\]
\(\mathbf{\beta}\mathbf{X_j}\) is the linear predictor. The GLM generalises this linear predictor through a linking function \(g\) (Nelder and Wedderburn 1972):
\[y_j=g\left(\mathbf{\beta}\mathbf{X}\right) + \epsilon_j\]
For example, binomial outcomes are commonly modelled using logistic
regression with the logit
linking function. Multiple
linking functions are available in familiar and are detailed below.
Linear models for binomial outcomes derive from the
stats
package which is part of the R core distribution
(R Core Team 2019). Hyperparameters for
these models include linkage functions and are shown in the table
below.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
linking function | family |
logistic , probit ,
loglog , cauchy |
glm only |
The linking function is not optimised when it is
specified, e.g. glm_logistic . |
sample weighting | sample_weighting |
inverse_number_of_samples ,
effective_number_of_samples , none |
no | Sample weighting allows for mitigating class
imbalances. The default is inverse_number_of_samples .
Instances with the majority class receive less weight. |
beta | sample_weighting_beta |
\(\mathbb{Z} \in \left[-6,-1\right]\) | effective_number_of_samples only |
Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\). |
The linear model for multinomial outcomes is implemented using the
VGAM::vglm
function (T. W. Yee and
Wild 1996; T. Yee 2010; Thomas W. Yee 2015). Hyperparameters for
this model are shown in the table below.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
linking function | family |
multinomial |
no | There is only one linking function available. |
sample weighting | sample_weighting |
inverse_number_of_samples ,
effective_number_of_samples , none |
no | Sample weighting allows for mitigating class
imbalances. The default is inverse_number_of_samples .
Instances with the majority class receive less weight. |
beta | sample_weighting_beta |
\(\mathbb{Z} \in \left[-6,-1\right]\) | effective_number_of_samples only |
Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\). |
Linear models for continuous and count-type outcomes derive from the
stats
package of the R core distribution (R Core Team 2019). Hyperparameters for these
models include linkage functions and are shown in the table below.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
linking function | family |
gaussian , log_gaussian ,
inv_gaussian , poisson ,
log_poisson |
glm only |
The linking function is not optimised when it is
specified, e.g. glm_poisson . gaussian ,
log_gaussian , inv_gaussian linking functions
are not available for count-type outcomes. |
Linear models for survival outcomes are divided into semi-parametric
and parametric models. The semi-parametric Cox proportional hazards
model (Cox 1972) is based on the
survival::coxph
function (Therneau
and Grambsch 2000). Tied survival times are resolved using the
default method by Efron (1977).
Various fully parametric models are also available, which differ in
the assumed distribution of the outcome, i.e. the linking function. The
parametric models are all based on survival::survreg
(Therneau and Grambsch 2000).
Hyperparameters for semi-parametric and parametric survival models are shown in the table below.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
linking function | family |
weibull , exponential ,
gaussian , logistic , lognormal ,
loglogistic |
survival_regr only |
The linking function is not optimised when it is
specified, e.g. survival_regr_weibull . The non-parametric
cox learner does not have a linking function. |
Generalised linear models can be problematic as there is no inherent limit to model complexity, which can easily lead to overfitting. Penalised regression, or shrinkage, methods address this issue by penalising model complexity.
Three shrinkage methods are implemented in familiar
,
namely ridge regression, lasso and elastic net, which are all
implemented using the glmnet
package (Hastie, Tibshirani, and Friedman 2009; Simon et al.
2011). The set of hyperparameters is shown in the table below.
The optimal lambda
parameter is determined by
cross-validation as part of the glmnet::cv.glmnet
function,
and is not directly determined using hyperparameter optimisation.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
elastic net penalty | alpha |
\(\mathbb{R} \in \left[0,1\right]\) | elastic net | This penalty is fixed for ridge regression
(alpha = 0 ) and lasso (alpha = 1 ). |
optimal lambda | lambda_min |
lambda.1se , lambda.min |
no | Default is lambda.min . |
number of CV folds | n_folds |
\(\mathbb{Z} \in \left[3,n\right]\) | no | Default is \(3\) if \(n<30\), \(\lfloor n/10\rfloor\) if \(30\leq n \leq 200\) and \(20\) if \(n>200\). |
sample weighting | sample_weighting |
inverse_number_of_samples ,
effective_number_of_samples , none |
no | Sample weighting allows for mitigating class
imbalances. The default is inverse_number_of_samples for
binomial and multinomial outcomes, and
none otherwise. Instances with the majority class receive
less weight. |
beta | sample_weighting_beta |
\(\mathbb{Z} \in \left[-6,-1\right]\) | effective_number_of_samples only |
Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\). |
normalisation | normalise |
FALSE , TRUE |
no | Default is FALSE , as normalisation is part
of pre-processing in familiar. |
Boosting is a procedure which combines many weak learners to form a
more powerful panel (Schapire 1990; Hastie,
Tibshirani, and Friedman 2009). Boosting learners are implemented
using the mboost
package (Bühlmann
and Hothorn 2007; Hothorn et al. 2010; Hofner, Boccuto, and Göker
2015). Both linear regression (mboost::glmboost
) and
regression trees (mboost::blackboost
) are implemented. The
hyperparameters are shown in the table below.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
family | family |
logistic , probit ,
bin_loglog , cauchy , log ,
auc , gaussian , huber ,
laplace , poisson , cox ,
weibull , lognormal , surv_loglog ,
gehan , cindex |
general tags | The family is not optimised when it is specified,
e.g. boosted_tree_gaussian . |
boosting iterations | n_boost |
\(\mathbb{R} \in \left[0,\infty\right)\) | yes | This parameter is expressed on the \(\log_{10}\) scale, i.e. the actual input value will be \(10^\texttt{n_boost}\). The default range is \(\left[0, 3\right]\). |
learning rate | learning_rate |
\(\mathbb{R} \in \left[-\infty,0\right]\) | yes | This parameter is expressed on the \(\log_{10}\) scale. The default range is \(\left[-5,0\right]\). |
sample weighting | sample_weighting |
inverse_number_of_samples ,
effective_number_of_samples , none |
no | Sample weighting allows for mitigating class
imbalances. The default is inverse_number_of_samples for
binomial and multinomial outcomes, and
none otherwise. Instances with the majority class receive
less weight. |
beta | sample_weighting_beta |
\(\mathbb{Z} \in \left[-6,-1\right]\) | effective_number_of_samples only |
Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\). |
maximum tree depth | tree_depth |
\(\mathbb{Z} \in \left[1,n\right]\) | trees only | Maximum depth to which trees are allowed to grow. |
minimum sum of instance weight | min_child_weight |
\(\mathbb{R} \in \left[0,\infty\right)\) | trees only | Minimal instance weight required for further branch partitioning, or the number of instances required in each node. This parameter is expressed on the \(\log_{10}\) scale with a \(-1\) offset. The default range is \(\left[0, 2\right]\). |
significance split threshold | alpha |
\(\mathbb{R} \in \left(0.0, 1.0\right]\) | trees only | Minimum significance level for further splitting. The default range is \(\left[10^{-6}, 1.0\right]\) |
Optimising the number of boosting iterations may be slow as models
with a large number of boosting iterations take longer to learn and
assess. If this is an issue, the n_boost
parameter should
be set to a smaller range, or provided with a fixed value.
Low learning rates may require an increased number of boosting iterations.
Also note that hyperparameter optimisation time depends on the type
of family chosen, e.g. cindex
is considerably slower than
cox
.
Boosting is a procedure which combines many weak learners to form a
more powerful panel (Schapire 1990; Hastie,
Tibshirani, and Friedman 2009). Extreme gradient boosting is a
gradient boosting implementation that was highly successful in several
machine learning competitions. Learners are implemented using the
xgboost
package (Chen and Guestrin
2016). Three types are implemented: regression based boosting
(xgboost_lm
), regression-tree based boosting
(xgboost_tree
) and regression-tree based boosting with
drop-outs (xgboost_dart
).
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
family | learn_objective |
gaussian,
continuous_logistic , multinomial_logistic ,
binomial_logistic , poisson ,
gamma , cox |
general tags | The family is not optimised when it is specified,
e.g. xgboost_lm_linear . |
boosting iterations | n_boost |
\(\mathbb{R} \in \left[0,\infty \right)\) | yes | This parameter is expressed on the \(\log_{10}\) scale, i.e. the actual value will be \(10^\texttt{n_boost}\). The default range is \(\left[0, 3 \right]\). |
learning rate | learning_rate |
\(\mathbb{R} \in \left(-\infty,0\right]\) | yes | This parameter is expressed on the \(\log_{10}\) scale. The default range is \(\left[-5, 0 \right]\). |
L1 regularisation | alpha |
\(\mathbb{R} \in \left[-6, \infty\right)\) | yes | This parameter is expressed on the \(\log_{10}\) scale with a \(10^{-6}\) offset. The default range is \(\left[-6, 3\right]\). |
L2 regularisation | lambda |
\(\mathbb{R} \in \left[-6, \infty\right)\) | yes | This parameter is expressed on the \(\log_{10}\) scale with a \(10^{-6}\) offset. The default range is \(\left[-6, 3\right]\). |
sample weighting | sample_weighting |
inverse_number_of_samples ,
effective_number_of_samples , none |
no | Sample weighting allows for mitigating class
imbalances. The default is inverse_number_of_samples for
binomial and multinomial outcomes, and
none otherwise. Instances with the majority class receive
less weight. |
beta | sample_weighting_beta |
\(\mathbb{Z} \in \left[-6,-1\right]\) | effective_number_of_samples only |
Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\). |
maximum tree depth | tree_depth |
\(\mathbb{Z} \in \left[1,\infty\right)\) | gbtree , dart |
Maximum depth to which trees are allowed to grow. The default range is \(\left[1, 10\right]\). |
subsampling fraction | sample_size |
\(\mathbb{R} \in \left(0, 1.0\right]\) | gbtree , dart |
Fraction of available data that is used for to create a single tree. The default range is \(\left[2 / m, 1.0\right]\), with \(m\) the number of samples. |
minimum sum of instance weight | min_child_weight |
\(\mathbb{R} \in \left[0,\infty\right)\) | gbtree , dart |
Minimal instance weight required for further branch partitioning, or the number of instances required in each node. This parameter is expressed on the \(\log_{10}\) scale with a \(-1\) offset. The default range is \(\left[0, 2\right]\). |
min. splitting error reduction | gamma |
\(\mathbb{R} \in \left[-6, \infty\right)\) | gbtree , dart |
Minimum error reduction required to allow splitting.
This parameter is expressed on the \(\log_{10}\) scale with a \(10^{-6}\) offset. The default range is
\(\left[-6, 3\right]\).
continuous and count -type outcomes are
normalised to the \(\left[0, 1\right]\)
range prior to model fitting to deal with scaling issues. Values are
converted back to the original scale after prediction. |
DART sampling algorithm | sample_type |
uniform , weighted |
dart |
– |
drop-out rate | rate_drop |
\(\mathbb{R} \in \left[0,1\right)\) | dart |
– |
Random forests (explain) (Breiman
2001). The random forest learner is implemented through the
randomForestSRC
package, which provides a unified interface
for different types of forests (Ishwaran et al.
2008, 2011).
An outcome variable that represents count-type data is first transformed using a \(\log(x+1)\) transformation. Predicted responses are then transformed to the original scale using the inverse \(\exp(x)-1\) transformation.
Hyperparameters for random forest learners are shown in the table below.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
number of trees | n_tree |
\(\mathbb{Z} \in \left[0,\infty\right)\) | yes | This parameter is expressed on the \(\log_{2}\) scale, i.e. the actual input value will be \(2^\texttt{n_tree}\) (Oshiro, Perez, and Baranauskas 2012). The default range is \(\left[4, 10\right]\). |
subsampling fraction | sample_size |
\(\mathbb{R} \in \left(0, 1.0\right]\) | yes | Fraction of available data that is used for to create a single tree. The default range is \(\left[2 / m, 1.0\right]\), with \(m\) the number of samples. |
number of features at each node | m_try |
\(\mathbb{R} \in \left[0.0, 1.0\right]\) | yes | Familiar ensures that there is always at least one candidate feature. |
node size | node_size |
\(\mathbb{Z} \in \left[1, \infty\right)\) | yes | Minimum number of unique samples in terminal nodes. The default range is \(\left[5, \lfloor m / 3\rfloor\right]\), with \(m\) the number of samples. |
maximum tree depth | tree_depth |
\(\mathbb{Z} \in \left[1,\infty\right)\) | yes | Maximum depth to which trees are allowed to grow. The default range is \(\left[1, 10\right]\). |
number of split points | n_split |
\(\mathbb{Z} \in \left[0, \infty\right)\) | no | By default, splitting is deterministic and has one split point (\(0\)). |
splitting rule | split_rule |
gini , auc ,
entropy , mse , quantile.regr ,
la.quantile.regr , logrank ,
logrankscore , bs.gradient |
no | Default splitting rules are gini for
binomial and multinonial outcomes,
mse for continuous and count
outcomes and logrank for survival
outcomes. |
sample weighting | sample_weighting |
inverse_number_of_samples ,
effective_number_of_samples , none |
no | Sample weighting allows for mitigating class
imbalances. The default is inverse_number_of_samples for
binomial and multinomial outcomes, and
none otherwise. Instances with the majority class receive
less weight. |
beta | sample_weighting_beta |
\(\mathbb{Z} \in \left[-6,-1\right]\) | effective_number_of_samples only |
Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\). |
Note that optimising that optimising the number of trees can be slow,
as big forests take longer to construct and perform more computations
for predictions. Hence hyperparameter optimisation may be sped up by
limiting the range of the n_tree
parameter, or setting a
single value.
The second implementation of random forests comes from the
ranger
package (Wright and Ziegler
2017). It is generally faster than the implementation in
randomForestSRC
and allows for different splitting rules,
such as maximally selected rank statistics (Lausen and Schumacher 1992; Wright, Dankowski, and
Ziegler 2017) and concordance index-based splitting (Schmid, Wright, and Ziegler 2016).
Hyperparameters for the random forest are shown in the table below.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
number of trees | n_tree |
\(\mathbb{Z} \in \left[0,\infty\right)\) | yes | This parameter is expressed on the \(\log_{2}\) scale, i.e. the actual input value will be \(2^\texttt{n_tree}\) (Oshiro, Perez, and Baranauskas 2012). The default range is \(\left[4, 10\right]\). |
subsampling fraction | sample_size |
\(\mathbb{R} \in \left(0, 1.0\right]\) | yes | Fraction of available data that is used for to create a single tree. The default range is \(\left[2 / m, 1.0\right]\), with \(m\) the number of samples. |
number of features at each node | m_try |
\(\mathbb{R} \in \left[0.0, 1.0\right]\) | yes | Familiar ensures that there is always at least one candidate feature. |
node size | node_size |
\(\mathbb{Z} \in \left[1, \infty\right)\) | yes | Minimum number of unique samples in terminal nodes. The default range is \(\left[5, \lfloor m / 3\rfloor\right]\), with \(m\) the number of samples. |
maximum tree depth | tree_depth |
\(\mathbb{Z} \in \left[1,\infty\right)\) | yes | Maximum depth to which trees are allowed to grow. The default range is \(\left[1, 10\right]\). |
splitting rule | split_rule |
gini , extratrees ,
variance , logrank , C ,
maxstat |
no | Default splitting rules are gini for
binomial and multinomial outcomes and
maxstat for continuous , count and
survival outcomes. |
significance split threshold | alpha |
\(\mathbb{R} \in \left(0.0, 1.0\right]\) | maxstat |
Minimum significance level for further splitting. The default range is \(\left[10^{-6}, 1.0\right]\) |
sample weighting | sample_weighting |
inverse_number_of_samples ,
effective_number_of_samples , none |
no | Sample weighting allows for mitigating class
imbalances. The default is inverse_number_of_samples for
binomial and multinomial outcomes, and
none otherwise. Instances with the majority class receive
less weight. |
beta | sample_weighting_beta |
\(\mathbb{Z} \in \left[-6,-1\right]\) | effective_number_of_samples only |
Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\). |
Note that optimising the number of trees can be slow, as big forests
take longer to construct and perform more computations for predictions.
Hence hyperparameter optimisation may be sped up by limiting the range
of the n_tree
parameter or setting a single value.
The naive Bayes classifier uses Bayes rule to predict posterior
probabilities. The naive Bayes classifier uses the
e1071::naiveBayes
function (Meyer et
al. 2021). The hyperparameters of the classifier are shown in the
table below.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
laplace smoothing | laplace |
\(\mathbb{R} \in \left[0.0, \infty \right)\) | yes | The default range is \(\left[0, 10\right]\). |
k-nearest neighbours is a simple clustering algorithm that
classifies samples based on the classes of their k nearest
neighbours in feature space. The e1071::gknn
function is
used to implement k-nearest neighbours (Meyer et al. 2021). The hyperparameters are
shown in the table below.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
number of nearest neighbours | k |
\(\mathbb{Z} \in \left[1,m\right]\) | yes | The default range is \(\left[1, \lceil 2 m^{1/3} \rceil \right]\), with \(m\) the number of samples. |
distance metric | distance_metric |
all metrics supported by
proxy::dist |
yes | The default set of metrics is gower ,
euclidean and manhattan . |
Support vector machines were originally defined to find optimal margins between classes for classification problems. Support vector machines were popularized after the use of the kernel trick was described. Using the kernel trick the calculations are performed in an implicit high-dimensional feature space, which is considerably more efficient than explicit calculations (Boser, Guyon, and Vapnik 1992).
Familiar implements SVM using e1071::svm
. We tried
kernlab::ksvm
, but this function would reproducibly freeze
during unit testing. By default, both features and outcome data are
scaled internally. This was used to derive the default hyperparameter
ranges specified in the table below.
parameter | tag | values | optimised | comments |
---|---|---|---|---|
signature size | sign_size |
\(\mathbb{Z} \in \left[1,n\right]\) | yes | – |
SVM kernel | kernel |
linear , polynomial ,
radial , sigmoid |
no | Default is radial , unless specified as
part of the learner’s name. |
\(C\) | c |
\(\mathbb{R}\) | yes | The cost for violation of constraints is expressed on a \(\log_{10}\) scale, i.e. the actual value is \(10^\texttt{c}\). The default range is \(\left[-5, 3\right]\). |
\(\epsilon\) | epsilon |
\(\mathbb{R}\) | continuous and count
outcomes |
The error tolerance \(\epsilon\) is only used for regression SVM. \(\epsilon\) is expressed on a \(\log_{10}\) scale. The default range is \(\left[-5, 1\right]\). |
\(\nu\) | nu |
\(\mathbb{R}\) | nu SVM type |
nu provides upper bounds for the fraction
of training errors and lower bounds for the fraction of support vectors
(Chang and Lin 2011). It is expressed on a
\(\log_{10}\) scale. The default range
is \(\left[-5, 1\right]\). |
inverse kernel width | gamma |
\(\mathbb{R}\) | non-linear kernels | This parameter specifies the inverse of the kernel width. It is expressed on the \(\log_{10}\) scale. The default range is \(\left[-9, 3\right]\). |
polynomial degree | degree |
\(\mathbb{Z} \in \left[1, \infty\right)\) | polynomial kernel | The default range is \(\left[1, 5\right]\). |
kernel offset | offset |
\(\mathbb{R} \in \left[0, \infty\right)\) | polynomial and sigmoid kernels | Negative values are not allowed. The default range is \(\left[0, 1 \right]\). |
Several types of SVM algorithms exist for classification and regression. The following SVM types are implemented:
c
nu
nu
eps
The following kernels are implemented:
linear
polynomial
radial
sigmoid
Hyperparameter optimisation is conducted to select model parameters that are more likely to lead to generalisable results. The main hyperparameter optimisation framework used by familiar is based on sequential model-based optimisation (SMBO) (Hutter, Hoos, and Leyton-Brown 2011), but with significant updates and extensions.
Overall, the following steps are conducted to optimise hyperparameters for a given learner and dataset:
A set of bootstraps of the development dataset is made. This allows for training the models using the in-bag data and assessing performance using out-of-bag data. This is independent of any bootstraps that may have been defined as part of the experimental design. In case a subsampling method is specified in the experimental design, the corresponding internal development dataset is bootstrapped.
Variable importance is determined for each of the in-bag datasets to avoid positively biasing the optimisation process through information leakage.
An initial exploration of the hyperparameter space is conducted. A model is trained using each hyperparameter set and a bootstrap subsample. It is then assessed by computing an optimisation score based on model performance in the out-of-bag, and optionally, in-bag data.
Exploration of the hyperparameter space then continues by iteratively comparing the best known hyperparameter set against challenger hyperparameter sets in so-called intensify steps. The challengers are selected by predicting the expected model performance for a new hyperparameter set and computing its utility (Shahriari et al. 2016). In addition, we predict model run times, and actively prune potential slow challengers. Utility is computed using an acquisition function, of which several are available in familiar. The set of challenger hyperparameter sets may moreover be iteratively pruned, dependent on the exploration method.
Exploration stops if no significant improvement to model performance could be established by further exploring the hyperparameter space, the total number of iterations has been reached, or bootstraps samples have been exhausted.
Models take a certain time to train. Familiar is actively measuring this time during hyperparameter optimisation for two reasons: first, to optimise assignment of jobs to parallel nodes; and secondly, to prune potential challenger sets that produce models that tend to run longer than the best-known model. The strictness increases with increased exploration.
Let \(m\) be the number of bootstraps used to quantify the most visited hyperparameter sets, and \(M\) the total number of bootstrap samples that can be visited. Then let \(t_{\texttt{opt}}\) be the runtime of the best known model. The maximum time that a challenger hyperparameter set is allowed for training is then empirically set to:
\[t_{\texttt{max}} = \left(5 - 4 \frac{m}{M} \right) t_{\texttt{opt}} \]
Hence, \(t_{\texttt{max}}\) converges to \(t_{\texttt{opt}}\) for \(m \rightarrow M\).
If maximum runtime is relatively insignificant, i.e. \(t_{\texttt{max}} < 10.0\) seconds, a threshold of \(t_{\texttt{max}}= 10.0\) is used.
If the maximum runtime is not known, i.e. none of the hyperparameter sets evaluated so far produced a valid model, the maximum time threshold is set to infinite. In effect, no hyperparameter sets are pruned based on expected run time.
A random forest is trained to infer runtimes for (new) hyperparameter sets, based on the runtimes observed for visited hyperparameter sets. The random forest subsequently infers runtime for a challenger hyperparameter set. The runtime estimate is compared against \(t_{\texttt{max}}\), and if it exceeds the threshold, it is rejected and not evaluated.
A summary score \(s\) determines how good a set of hyperparameters \(\mathbf{x}\) is. This score is computed in two steps. First an objective score is computed to assess model performance for a hyperparameter set in a bootstrap subsample. Subsequently an optimisation score is computed within the subsample. Finally a summary score is determined from the optimisation score over all subsamples.
An objective score \(s''\) is computed from the performance metric value for a specific hyperparameter set for in-bag (IB; \(s''_{\textrm{IB}}\)) and out-of-bag data (OOB; \(s''_{\textrm{OOB}}\)). The objective score always lies in the interval \([-1.0, 1.0]\). An objective score of \(1.0\) always indicates the best possible score. A score of \(0.0\) indicates that the hyperparameter set leads to the same degree of performance as the best educated guess, i.e. the majority class (for binomial and multinomial outcomes), the median value (for continuous and count outcomes), or tied risk or survival times (for survival outcomes). Objective scores that are either missing or below \(-1.0\) are truncated to the value of \(-1.0\). In case multiple metrics are used to assess model performance, the mean of the respective objective scores is used.
The optimisation scores \(s'\)
and summary score \(s\) of each
hyperparameter set are subsequently computed using one of the following
functions, which can be set using the optimisation_function
parameter:
validation
, max_validation
(default):
The optimisation score for each bootstrap subsample is \(s'=s''_{\textrm{OOB}}\). The
summary score \(s\) is then the average
of optimisation scores \(s'\). This
is a commonly used criterion that tries to maximise the score on the OOB
validation data.
balanced
: The optimisation score for each bootstrap
subsample is \(s'=s''_{\textrm{OOB}} - \left|
s''_{\textrm{OOB}} - s''_{\textrm{IB}} \right|\).
The summary score \(s\) is then the
average of optimisation scores \(s'\). A variation on
max_validation
with a penalty for differences in
performance between the IB and OOB data. The underlying idea is that a
good set of hyperparameters should lead to models that perform well on
both development and validation data.
strong_balance
: The optimisation score for each
bootstrap subsample is \(s'=s''_{\textrm{OOB}} - 2\left|
s''_{\textrm{OOB}}-s''_{\textrm{IB}} \right|\).
The summary score \(s\) is then the
average of optimisation scores \(s'\). A variant of
balanced
with a stronger penalty term.
validation_minus_sd
: The optimisation score for each
bootstrap subsample is \(s'=s''_{\textrm{OOB}}\). The
summary score \(s\) is then the average
of optimisation scores \(s'\) minus
the standard deviation of \(s'\).
This penalises hyperparameter sets that lead to wide variance on OOB
data.
validation_25th_percentile
: The optimisation score
for each bootstrap subsample is \(s'=s''_{\textrm{OOB}}\). The
summary score \(s\) is the 25th
percentile of optimisation scores \(s'\). Like
validation_minus_sd
this penalises hyperparameter sets that
lead to wide variance on OOB data.
model_estimate
: The optimisation score for each
bootstrap subsample is \(s'=s''_{\textrm{OOB}}\). The
model used to infer utility of new hyperparameter sets (see the next
section Predicting
optimisation score for new hyperparameter sets) is used to predict
the expected optimisation score \(\mu(\mathbf{x})\). This score is used as
summary score \(s\). Not available for
random search.
model_estimate_minus_sd
: The optimisation score for
each bootstrap subsample is \(s'=s''_{\textrm{OOB}}\). The
model used to infer utility of new hyperparameter sets (see the next
section Predicting
optimisation score for new hyperparameter sets) is used to predict
the expected optimisation score \(\mu(\mathbf{x})\) and its standard
deviation \(\sigma(\mathbf{x})\). Then
\(s = \mu(\mathbf{x}) -
\sigma(\mathbf{x})\). This penalises hyperparameter sets for
which a large variance is expected. Not available for random
search.
The summary score is then used to select the best known
hyperparameter set, i.e. the set that maximises the summary score.
Moreover, the optimisation scores are used to identify candidate
hyperparameter sets as described in the following section. The
optimisation function is set using the
optimisation_function
argument.
Any point in the hyperparameter space has a single, scalar, optimisation score value that is a priori unknown. During the optimisation process, the algorithm samples from the hyperparameter space by selecting hyperparameter sets and computing the optimisation score for one or more bootstraps. For each hyperparameter set the resulting scores are distributed around the actual value.
A key point of Bayesian optimisation is the ability to estimate the usefulness or utility of new hyperparameter sets. This ability is facilitated by modelling the optimisation score of new hyperparameter sets using the optimisation scores of observed hyperparameter sets.
The following models can be used for this purpose:
gaussian_process
(default): Creates a localised
approximate Gaussian deterministic Gaussian Process. This is implemented
using the laGP
package (Gramacy
2016).
bayesian_additive_regression_trees
or
bart
: Uses Bayesian Additive Regression Trees for
inference. Unlike standard random forests, BART allows for estimating
posterior distributions directly and can extrapolate. BART is
implemented using the BART
package (Sparapani, Spanbauer, and McCulloch
2021).
random_forest
: Creates a random forest for
inference. A random forest was originally used in the SMBO algorithm by
Hutter et al. (Hutter, Hoos, and Leyton-Brown
2011). A weakness of random forests is their lack of
extrapolation beyond observed values, which limits their usefulness in
exploiting promising areas of hyperparameter space somewhat.
In addition, familiar can perform random search (random
or random_search
). This forgoes the use of models to steer
optimisation. Instead, the hyperparameter space is sampled at
random.
The learner used to predict optimisation scores can be specified
using the hyperparameter_learner
parameter.
The expected values and the posterior values of optimisation scores from the models are used to compute utility of new hyperparameter sets using an acquisition function. The following acquisition functions are available in familiar (Shahriari et al. 2016). Let \(\nu=f(\mathbf{x})\) be the posterior distribution of hyperparameter set \(\mathbf{x}\), and \(\tau\) the best observed optimisation score. Let also \(\mu(\mathbf{x})\) and \(\sigma(\mathbf{x})\) be the sample mean and sample standard deviation for set \(\mathbf{x}\) (implicitly for round \(m\)):
improvement_probability
: The probability of
improvement quantifies the probability that the expected optimisation
score for a set \(\mathbf{x}\) is
better than the best observed optimisation score \(\tau\):
\[\alpha(\mathbf{x}) = P(\nu > \tau) = \Phi \left(\frac{\mu(\mathbf{x}) - \tau} {\sigma(\mathbf{x})}\right)\]
Here \(\Phi\) is the cumulative distribution function of a normal distribution. Note that this acquisition function is typically prone to convergence to local minima, although in our algorithm this may be mitigated by the alternating random search strategy.
improvement_empirical_probability
: Similar to
improvement_probability
, but based directly on optimisation
scores predicted by the individual decision trees:
\[\alpha(\mathbf{x}) = P(\nu > \tau) = \frac{1}{N} \sum_{i=1}^N \left[\nu_i > \tau\right]\]
with \(\left[\ldots\right]\) denoting an Iverson bracket, which takes the value 1 if the condition specified by \(\ldots\) is true, and 0 otherwise.
expected_improvement
(default): Expected improvement
is based on an improvement function, and is be computed as follows:
\[\alpha(\mathbf{x})=(\mu(\mathbf{x})-\tau) \Phi \left(\frac{\mu(\mathbf{x}) - \tau}{\sigma(\mathbf{x})}\right) + \sigma(\mathbf{x}) \phi\left(\frac{\mu(\mathbf{x}) - \tau}{\sigma(\mathbf{x})}\right)\]
with \(\phi\) denoting the normal probability density function. If \(\sigma(\mathbf{x}) = 0\), \(\alpha(\mathbf{x})=0\).
upper_confidence_bound
: This acquisition function is
based on the upper confidence bound of the distribution, and is defined
as (Srinivas et al. 2012):
\[\alpha(\mathbf{x}, t)=\mu(\mathbf{x}) + \beta_t \sigma(\mathbf{x})\]
Here the parameter \(\beta_t\) depends on the current round \(t\), which in our implementation would be roughly equivalent to the maximum number of bootstraps performed for any of the hyperparameter sets, minus \(1\). The \(\beta_t\) parameter is then defined as \(\beta_t= 2 \log(|D|(t+1)^2\pi^2/6\delta)\) (Srinivas et al. 2012). Here, \(\delta \in [0,1]\) and \(|D|\) the dimensionality of the hyperparameter space. For their experiments Srinivas et al. used \(\delta=0.1\), and noted that scaling down \(\beta_t\) by a factor \(5\) improves the algorithm. Hence, we use \(\beta_t= \frac{2}{5} \log\left(\frac{5}{3} |D|(t+1)^2\pi^2\right)\).
bayes_upper_confidence_bound
: Proposed by Kaufmann
et al. (Kaufmann, Cappé, and Garivier
2012), the Bayesian upper confidence bound is defined as:
\[\alpha(\mathbf{x}, t)=Q \left(1-\frac{1}{(t + 1) \log(n)^c}, \nu_t\right)\]
with \(Q\) being a quantile function. Note that we here deviate from the original definition by substituting the original \(t\) by \(t+1\), and not constraining the posterior distribution \(\nu\) to a known function. In the original, \(n\) defines the number of rounds, with \(t\) the current round. In our algorithm, this is interpreted as \(n\) being the total number of bootstraps, and \(t\) the maximum number of bootstraps already performed for any of the hyperparameter sets. Hence our implementation is the same, aside from the lack of constraint on the posterior distribution. Based on the simulations and recommendations by Kaufmann et al. we choose parameter \(c=0\).
The acquisition function can be specified using the
acquisition_function
parameter.
After selecting challenger hyperparameter sets, the optimisation score of models trained using these hyperparameter sets are compared against the best-known score in a run-off intensify step. During this step, models will be trained on new bootstrap data, compared, and then trained on another set of bootstraps, and so on. Pruning the set of challenger hyperparameters after each iteration reduces computational load. Familiar implements the following methods:
successive_halving
(default): The set of alternative
parameter sets is pruned by removing the worst performing half of the
sets after each step (Jamieson and Talwalkar
2016). The set of investigated parameter sets gets progressively
smaller.
stochastic_reject
: The set of alternative parameter
sets is pruned by comparing the performance of each parameter set with
that of the incumbent best parameter set using a paired Wilcoxon
test.
none
: The set of alternative parameter sets is not
pruned.
The method used to steer exploration can be set using the
exploration_method
parameter.
It is possible to set hyperparameters manually. This can be used to change hyperparameters that are fixed by default, to set a fixed value for randomised hyperparameters, or to provide a different search range for randomised hyperparameters.
Hyperparameters can be provided using the hyperparameter
tag. For the glm_logistic
learner an example tag may for
example look as follows:
<hyperparameter>
<glm_logistic>
<sign_size>5</sign_size>
</glm_logistic>
</hyperparameter>
Or as a nested list passed as the hyperparameter
argument to summon_familiar
:
hyperparameter = list("glm_logistic"=list("sign_size"=5))
More than one value can be provided. The behaviour changes depending on whether the hyperparameter is categorical or numeric variable. In case of categorical hyperparameters, the provided values define the search range. For numerical hyperparameters, providing two values sets the bounds of the search range, whereas providing more than two values will define the search range itself.
Hyperparameter optimization may be configured using the tags/arguments in the table below.
tag / argument | description | default |
---|---|---|
optimisation_bootstraps |
Maximum number of bootstraps created for optimisation. | 20 |
optimisation_determine_vimp |
If TRUE , compute variable importance for
each bootstrap. If FALSE use variable importance computed
during the feature selection step. |
TRUE |
smbo_random_initialisation |
If random initial parameter sets are
generated randomly from default ranges. If fixed or
fixed_subsample , the initial parameter sets are based on a
grid in parameter space. |
fixed_subsample |
smbo_n_random_sets |
Sets the number of hyperparameter sets drawn for
initialisation. Ignored if smbo_random_initialisation is
fixed . If smbo_random_initialisation is
fixed_subsample , the number of selected hyperparameters may
be lower. |
100 |
max_smbo_iterations |
Maximum number of intensify iterations of the SMBO algorithm. | 20 |
smbo_stop_convergent_iterations |
Number of subsequent convergent SMBO iterations required to stop hyperparameter optimisation early. | 3 |
smbo_stop_tolerance |
Tolerance for recent optimisation scores to determine convergence. | between 0.01 and 0.001 |
smbo_time_limit |
Time limit (in minutes) for the optimisation process. | NULL |
smbo_initial_bootstraps |
Number of bootstraps assessed initially. | 1 |
smbo_step_bootstraps |
Number of bootstraps used within each step of an intensify iteration. | 3 |
smbo_intensify_steps |
Number of intensify steps within each intensify iteration. | 5 |
optimisation_metric |
The metric used for optimisation,
e.g. auc . See the vignette on performance metrics for
available options. More than one metric may be specified. |
auc_roc (binomial ,
multinomial ); mse (continuous );
msle (count ); and
concordance_index (survival ) |
optimisation_function |
The optimisation function used (see Assessing goodness of hyperparameter sets). | balanced |
hyperparameter_learner |
Learner used to predict optimisation scores for new hyperparameter sets (see Predicting optimisation score for new hyperparameter sets). | gaussian_process |
acquisition_function |
The function used to quantify utility of hyperparameter sets (see Acquisition functions for utility of hyperparameter sets). | expected_improvement |
exploration_method |
Method used to explore challenger hyperparameter sets (see Exploring challenger sets). | successive_halving |
smbo_stochastic_reject_p_value |
The p-value level for stochastic pruning. | 0.05 |
parallel_hyperparameter_optimisation |
Enables parallel processing for hyperparameter
optimisation. Ignored if parallel=FALSE . Can be
outer to process multiple optimisation processes
simultaneously. |
TRUE |
Even though learners may be good at discrimination, model calibration can be lacklustre. Some learners are therefore recalibrated as follows:
3-fold cross-validation is performed. Responses are predicted using a new model trained on each training fold.
Three recalibration models are trained using the responses for the training folds as input (Niculescu-Mizil and Caruana 2005).
The three recalibration models are then applied to (new) responses of the full model, and the resulting values averaged for each sample.
The following learners currently undergo recalibration:
learner | outcome | recalibration model |
---|---|---|
xgboost_lm ,
xgboost_lm_logistic |
binomial, multinomial | glm_logistic |
xgboost_tree ,
xgboost_tree_logistic |
binomial, multinomial | glm_logistic |
xgboost_lm_cox |
survival | glm_cox |
xgboost_tree_cox |
survival | glm_cox |
boosted_glm_cindex ,
boosted_glm_gehan |
survival | glm_cox |
boosted_tree_cindex ,
boosted_tree_gehan |
survival | glm_cox |
Familiar currently does not recalibrate models by inverting the model calibration curves, but may do so in the future.