The BT
package implements (Adaptive) Boosting Tree for
Poisson distributed response variables, using log-link
function. When presented with data, the BT
package offers
the user the ability to build predictive models and explore the
influence of different variables on the response, akin to a data mining
or exploration task. The built package is based on the original idea
proposed by D. Hainaut, J. Trufin and M. Denuit. For more theoretical
details, we refer to the following references:
To get started with the package a user must:
data.frame
.BT
model.Once these steps have been completed, a user can fit their
BT
model by a call to BT
and subsequently:
evaluate its performance, make predictions, fit additional trees and
plot the relative influence of the various predictor variables
library(BT)
To demonstrate the BT
usage, we’ll use a simulated
database where the response variable is Poisson distributed. The built
database contains \(n = 50000\) records
as well as the following simulated variables (with their interpretation
in an insurance MTPL context):
Y
, the Poisson distributed response variable (e.g. the
number of claims).Gender
, Age
and Sport
which
are directly linked to the response variable (e.g. resp. the
policyholder’s gender and age, the vehicle type).Split
which is not linked to the response
variable.ExpoR
which corresponds to the time-exposure (e.g. the
well-known exposure-to-risk).Y_normalize = Y/ExpoR
.set.seed(4)
<- 50000
n
<- factor(sample(c("male","female"),n,replace=TRUE))
Gender <- sample(c(18:65),n,replace=TRUE)
Age <- factor(sample(c("yes","no"),n,replace=TRUE))
Split <- factor(sample(c("yes","no"),n,replace=TRUE))
Sport
<- 0.1*ifelse(Gender=="male",1.1,1)
lambda <- lambda*(1+1/(Age-17)^0.5)
lambda <- lambda*ifelse(Sport=="yes",1.15,1)
lambda
<- runif(n)
ExpoR
<- rpois(n, ExpoR*lambda)
Y <- Y/ExpoR
Y_normalized
<- data.frame(Y,Gender,Age,Split,Sport,ExpoR, Y_normalized) dataset
BT
parametersOnce the database has been imported, the user needs to define the different parameters that will be passed to the algorithm. In this case, the parameters should not be stored upfront in a specific object and a simple call using the argument can be made. We list below all the available parameters:
formula
: a symbolic description of the model to be fit.
We emphasize that the offset are not supported with out approach.data
: the database on which the computations will be
performed.tweedie.power
: Experimental parameter : the Tweedie
power referencing to the response variable distribution. Always set to 1
referring to Poisson distribution.ABT
: bool value to define whether we fit a Boosting
Tree (=FALSE
) or an Adaptive Boosting Tree
(=TRUE
).n.iter
: the number of iterations to user in the
fitting.train.fraction
: the percentage of the data
that will be used as training set. The remaining part will be used as
validation set.interaction.depth
: the maximum number of splits in a
tree present in the expansion.shrinkage
: acts as regularization for additional
iterations - the smaller the shrinkage generally the better the
performance of the fit. However, smaller shrinkage implies that the
number of trees may need to be increased to achieve a certain
performance.bag.fraction
: the fraction of the training observations
randomly sub-sampled to fit a tree in each iteration. This has the
effect of reducing the variance of the boosted model.colsample.bytree
: the number of variables randomly
sampled that will be used to build the next tree in the expansion.keep.data
: allows the user the save the different
databases used during the algorithm in a BTData
object.is.verbose
: whether the algorithm evolution has to be
printed out.cv.folds
: the number of cross-validation folds to
create. If set to 1 (by default), no cross-validation is performed.folds.id
: used in case of cross-validation. If defined,
allows the user to specify in what fold each observation belongs. If
cv.folds
is greater than 1 and folds.id
is
well defined, the latter will take over.n.cores
: in case of cross-validation, the number of
cores used to perform the parallelization. Please note that in the
cross-validation context, a call to the parLapply
function
is made (whatever the number of cores). This parameter is originally set
to cv.folds
.tree.control
: the proposed algorithm is based on the
rpart
package. This parameter will be used to originally
build each tree in the expansion. We emphasize that if the
interaction.depth
is set to NULL
, each tree in
the expansion will be built thanks to this parameters with no further
treatment. We recommend this option for advanced user only.weights
: a vector representing the weight given to each
observation. By default, each observation as the same weight (=1).seed
: some of the parameters bring randomness during
the algorithm. This parameter allows the user to replicate the
results.We emphasize that performing a cross-validation will produce a first
global model trained on the full training set as well as different cv
related BT
models. The former will generally further be
used while the later will help to e.g. determine the performances.
Let us now build the first model. Let us note/suppose that:
Y
follows a Poisson distribution. This first step can normally be achieved
thanks to a short database analysis.Y
. In usual MTPL pricing, the
ExpoR
variable allows to consider the time-exposure to risk
(e.g. the Policyholder contract’s duration). The latter variable is thus
generally set in offset. In a Tweedie framework with log-link function,
we emphasize that working with offset is equivalent to build a model on
the rate response variable and adjusted weights. More precisely, the
variable Y_normalized
will be used as response variables
with weights given by ExpoR
(instead of original values,
i.e. 1). This choice was made during modelling by the author.n.iter
).interaction.depth
and shrinkage
parameters will respectively be set to 4 and 0.01.train.fraction
will be set to 0.8, meaning that the
first 0.8*nrow(data)
will be considered as training set and
the remaining data
records as validation set. Moreover,
random effects such as bagging and columns sample for each tree will be
defined. The bag.fraction
and colsample.bytree
parameters will respectively be set to 0.5 and 3.<- BT(formula = as.formula("Y_normalized ~ Age + Sport + Split + Gender"),
BT_algo data = dataset,
ABT = F,
n.iter = 300,
train.fraction = 0.8,
interaction.depth = 4,
shrinkage = 0.01,
bag.fraction = 0.5,
colsample.bytree = 3,
keep.data = T,
is.verbose = F,
cv.folds = 3,
folds.id = NULL,
n.cores = 1,
weights = ExpoR,
seed = 44)
Now that the first fit is made, we’ll focus on the different results/functions that are available within the package.
BT
outputsMultiple results are stored in the BT_algo
run (namely,
a BTFit
object)
First of all, almost all the parameters that have been used during the call are stored in the result object:
# 'Fitting parameters
$call BT_algo
## BT(formula = as.formula("Y_normalized ~ Age + Sport + Split + Gender"),
## data = dataset, ABT = F, n.iter = 300, train.fraction = 0.8,
## interaction.depth = 4, shrinkage = 0.01, bag.fraction = 0.5,
## colsample.bytree = 3, keep.data = T, is.verbose = F, cv.folds = 3,
## folds.id = NULL, n.cores = 1, weights = ExpoR, seed = 44)
$distribution BT_algo
## [1] 1
$BTParams BT_algo
## $ABT
## [1] FALSE
##
## $train.fraction
## [1] 0.8
##
## $shrinkage
## [1] 0.01
##
## $interaction.depth
## [1] 4
##
## $bag.fraction
## [1] 0.5
##
## $n.iter
## [1] 300
##
## $colsample.bytree
## [1] 3
##
## $tree.control
## $tree.control$minsplit
## [1] 2
##
## $tree.control$minbucket
## [1] 1
##
## $tree.control$cp
## [1] -Inf
##
## $tree.control$maxcompete
## [1] 4
##
## $tree.control$maxsurrogate
## [1] 5
##
## $tree.control$usesurrogate
## [1] 2
##
## $tree.control$surrogatestyle
## [1] 0
##
## $tree.control$maxdepth
## [1] 4
##
## $tree.control$xval
## [1] 0
##
##
## attr(,"class")
## [1] "BTParams"
$keep.data BT_algo
## [1] TRUE
$is.verbose BT_algo
## [1] FALSE
$seed BT_algo
## [1] 44
$cv.folds # #used folds BT_algo
## [1] 3
head(BT_algo$folds, 10) # folds created.
## [1] 2 2 1 1 3 3 3 1 3 1
# Name of weights, response and explnatory variables used
$w BT_algo
## [1] "w"
$response BT_algo
## [1] "Y_normalized"
$var.name BT_algo
## [1] "Age" "Sport" "Split" "Gender"
One can then continue by having a look at the initialization that is performed as well as the related errors:
# Tweedie with intercept only.
$BTInit$initFit BT_algo
## [1] 0.1480054
# Init training error
$BTInit$training.error BT_algo
## [1] 0.3685759
# Init validation error
$BTInit$validation.error BT_algo
## [1] 0.3528488
If keep.data = TRUE
, the created training and validation
set are also returned:
head(BT_algo$BTData$training.set)
## Y_normalized Age Sport Split Gender w currTrainScore
## 1 0 28 yes no female 0.10197573 -1.878222
## 2 0 62 no yes male 0.71179081 -2.021452
## 3 0 37 no no male 0.37298030 -1.915680
## 4 0 56 yes no male 0.62134765 -1.895872
## 5 0 36 yes no male 0.10604182 -1.846164
## 6 0 24 yes yes female 0.05905927 -1.866964
head(BT_algo$BTData$validation.set, 5)
## Y_normalized Age Sport Split Gender w currValScore
## 40001 0.00000 49 yes no female 0.9998001 -1.876190
## 40002 0.00000 39 no yes male 0.7711231 -1.991724
## 40003 0.00000 29 yes yes male 0.1610786 -1.834477
## 40004 0.00000 33 no yes female 0.5680789 -2.038986
## 40005 2.98261 40 yes no female 0.3352768 -1.895784
One can also have a look at the fitted values (on the score scale) as well as the errors computed across the different iterations (on training set, validation set, the OOB improvement and the cross-validation error):
# Obtained fitting results at the end - Full BT and CV.
head(BT_algo$fitted.values)
## [1] -1.878222 -2.021452 -1.915680 -1.895872 -1.846164 -1.866964
head(BT_algo$cv.fitted) # cv fitted.
## [1] -1.892785 -1.934626 -1.995363 -1.854360 -1.845289 -1.889145
# Errors computed.
head(BT_algo$BTErrors$training.error)
## [1] 0.3637171 0.3691590 0.3623403 0.3644938 0.3650795 0.3699250
head(BT_algo$BTErrors$validation.error)
## [1] 0.3528371 0.3528110 0.3528009 0.3527770 0.3527524 0.3527456
head(BT_algo$BTErrors$oob.improvement)
## [1] 2.336664e-05 3.498181e-05 9.592938e-06 2.194806e-05 2.780747e-05
## [6] 1.158397e-05
head(BT_algo$BTErrors$cv.error)
## [1] 0.3685543 0.3685281 0.3685146 0.3684967 0.3684822 0.3684594
Finally, we note that all the trees fitted during the algorithm are
stored in the BTIndivFits
list:
$BTIndivFits[[1]] # First tree in the expansion BT_algo
## n= 20000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 20000 7274.6970 0.9851696
## 2) Age>=18.5 19600 7038.6580 0.9661252
## 4) Age>=38.5 11193 3896.8360 0.9131932 *
## 5) Age< 38.5 8407 3136.2120 1.0370160
## 10) Gender=female 4225 1474.3530 0.9393184
## 20) Age>=19.5 4026 1380.0720 0.9104576 *
## 21) Age< 19.5 199 89.4214 1.5041700 *
## 11) Gender=male 4182 1656.0760 1.1354260 *
## 3) Age< 18.5 400 215.7846 1.8509420 *
# One can also access to the usual tree outputs.
head(BT_algo$BTIndivFits[[1]]$frame)
## var n wt dev yval complexity ncompete nsurrogate
## 1 Age 20000 1478.9331 7274.697 0.9851696 0.0027842524 2 0
## 2 Age 19600 1448.0725 7038.658 0.9661252 0.0007830279 2 0
## 4 <leaf> 11193 829.0393 3896.836 0.9131932 0.0007353703 0 0
## 5 Gender 8407 619.0332 3136.212 1.0370160 0.0007830279 2 2
## 10 Age 4225 310.9132 1474.353 0.9393184 0.0006680935 1 0
## 20 <leaf> 4026 296.6375 1380.072 0.9104576 -Inf 0 0
## yval2.1 yval2.2
## 1 0.9851696 1457.0000000
## 2 0.9661252 1399.0000000
## 4 0.9131932 757.0000000
## 5 1.0370160 642.0000000
## 10 0.9393184 292.0000000
## 20 0.9104576 270.0000000
BT_perf
functionThis function allows the user to determine the best number of iterations that has to be performed. This one also depends on the type of errors that are available/have been computed during training phase.
training.error
is automatically computed. Please
note that in case of bagging used, this corresponds to the in-bag errors
(i.e. a random sub-selection of the original training set).train.fraction
has properly be defined, a
validation.error
will be computed on the validation
set.bag.fraction
has properly be defined, an
oob.improvement
vector will be computed.cv.error
will be computed.These values are stored in the BTErrors
object as
previously shown.
Depending on the chosen approach, the following methods can be applied to compute the best number of iterations.
validation.error
, the
argmin(BT$BTErrors$validation.error)
will be returned as
optimal iteration.oob.improvement
, the
argmin(-cumsum(BT$BTErrors$oob.improvement))
will be
returned as optimal iteration. To be precise, the
oob.improvement
are not used as such but a smoothed version
of it.cv.error
, the
argmin(BT$BTErrors$cv.error)
will be returned as optimal
iteration.We now present the function arguments:
BTFit_object
: a BT
algorithm result. In
our case, BT_algo
.method
: Allows the user to specify the method that has
to be applied to compute the best number of iterations. This can be set
to validation
, OOB
or cv
depending whether the user wants to use validation.error
,
oob.improvement
or cv.error
as previously
explained. We emphasize that without specifying the method
argument a best guest approach will be performed.plot.it
, oobag.curve
, overlay
and main
: plot related parameters. If desired, the
BT_perf
function plots the computed errors alongside
returning the optimal iteration.<- BT_perf(BT_algo, method='validation') optimal_perf_validation
<- BT_perf(BT_algo, method='OOB', oobag.curve = TRUE, overlay = TRUE) optimal_perf_oob
## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive.
## Using cv_folds>1 when calling BT usually results in improved predictive performance.
<- BT_perf(BT_algo, method='cv') optimal_perf_cv
<- BT_perf(BT_algo, plot.it = FALSE) optimal_perf_best_guest
## Using validation method...
summary
This function allows to compute the relative influence and plot it.
It is in fact a wrapper for the BT_relative_influence
which
is not intended to be used per end-user.
Up to now, the computation of the relative influence isn’t available for the permutation approach. This one should still be developed.
Regarding the currently developed method, we used the results
furnished by rpart.object
. In fact, each tree in the
expansion is built thanks to this package and is stored in the
BT$BTIndivFits
list. Moreover, note that this algorithm has
been adapted from rpart
itself and therefore cover special
cases (e.g. need to rescale for anova
method).
One can then present the function’s arguments:
object
: a BTFit
object, i.e. the result of
the BT
call.n.iter
: the number of iterations to use to compute the
relative influence. This parameter is often set to the optimal number of
iterations. By default, all the built trees will be used.method
: the function that has to be called to compute
the relative influence. As previously mentioned, only one approach is
currently available. This parameter should therefore remains set to its
default value.normalize
: if the user wants to normalize the relative
influence such that the sum over all normalized relative influence sum
up to 100.order_it
: indicates whether the user wants to sort the
relative influence or not.cBars
and plot_it
: relative influence plot
related parameters, respectively the number of bars to plot in the
barplot and a boolean specifying whether the plot is expected or
not.summary(BT_algo)
## var rel_inf
## Age Age 70.162422
## Sport Gender 13.477682
## Gender Sport 9.855048
## Split Split 6.504847
predict
functionUsing the global fit on the training set (the unique fit if no cross-validation), one can predict (S3 method) on a new database. Here are the function arguments:
object
: a BTFit
objectnewdata
: the new data set used to realize the
predictions.n.iter
: the number of boosting iterations (i.e. the
number of trees) used to perform the predictions. Usually, all the
iterations (i.e. the trees) up to the best one are considered to build
the predictions. Please note that this parameter can be a vector. In
such a case, a matrix containing the predictions for each element in
n.iter
will be returned.type
: Specify if one wants to predict on the ‘response’
or the ‘link’ scale.single.iter
: If set to TRUE
only the
n.iter
tree will be used to predict (i.e. not all the trees
up to n.iter
).Please note that if the keep.data
argument was set to
TRUE
and if the newdata
is not specified, the
prediction will be achieved on the original training set.
# Predict (response scale) using all trees up to best validation iteration.
head(predict(BT_algo, n.iter = optimal_perf_validation, type = 'response'), 10)
## As newdata is missing or is not a data frame, the training set has been used thanks to the keep.data = TRUE parameter.
## [1] 0.1512079 0.1374772 0.1479626 0.1501892 0.1551795 0.1532813 0.1334080
## [8] 0.1412066 0.1363647 0.1244193
# Predict (link scale) using all trees up the best iteration OOB/CV.
head(predict(BT_algo, newdata = dataset, n.iter = c(optimal_perf_oob, optimal_perf_cv), type = 'link'), 10)
## [,1] [,2]
## [1,] -1.886696 -1.881140
## [2,] -1.987535 -2.013851
## [3,] -1.911317 -1.910444
## [4,] -1.895761 -1.901736
## [5,] -1.859751 -1.853420
## [6,] -1.858016 -1.864961
## [7,] -2.027736 -2.075813
## [8,] -1.962980 -1.977309
## [9,] -2.000500 -2.043417
## [10,] -2.098513 -2.209579
# Predict using only the 40th tree.
head(predict(BT_algo, n.iter = 40, type = 'response', single.iter = TRUE), 10)
## As newdata is missing or is not a data frame, the training set has been used thanks to the keep.data = TRUE parameter.
## [1] 1.0726613 0.7460608 1.0726613 0.9975383 1.0726613 1.0726613 0.7460608
## [8] 1.0726613 0.7460608 0.9109593
print
functionOne can also print the BTFit
object (i.e. the result of
the BT
call) using the print
S3 method. The
later will display the call, the relative influence and the best
iteration depending on the methods (i.e. validation
,
OOB
and cv
) available.
print(BT_algo)
## BT(formula = as.formula("Y_normalized ~ Age + Sport + Split + Gender"),
## data = dataset, ABT = F, n.iter = 300, train.fraction = 0.8,
## interaction.depth = 4, shrinkage = 0.01, bag.fraction = 0.5,
## colsample.bytree = 3, keep.data = T, is.verbose = F, cv.folds = 3,
## folds.id = NULL, n.cores = 1, weights = ExpoR, seed = 44)
## A boosting tree model with Tweedie parameter : 1 has been fitted.
## 300 iterations were performed.
## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive.
## Using cv_folds>1 when calling BT usually results in improved predictive performance.
## The best out-of-bag iteration was 136.
## The best cross-validation iteration was 243.
## The best validation-set iteration was 115.
## There were 4 predictors of which 4 had non-zero influence.
BT_more
functionSometimes, it might be useful to continue the training on further
iterations. This can happen e.g. if the initial n.iter
parameter was not set high enough and that the best iteration computed
corresponds to this value, meaning that the minimal error (and the
related iteration) has yet to be found. This training continuation can
be performed thanks to the BT_more
function. This one has
the following argument:
BTFit_object
: an initial BT
call on which
we want to continue the training/perform more iterations.new.n.iter
: number of new boosting/tree iterations to
compute. In total, the BT
object will end up with
n.iter + new.n.iter
iterations.is.verbose
: whether or not the user wants to display
algorithm evolution.seed
: optional parameter that allows reproducible
example.It will then return a BTFit
object (as the
BT
function does) augmented by the new boosting
iterations.
We emphasize that the call to this function call only be made if the
original BT
call: * has no cross-validation; * has been
computed with keep.data
parameter set to
TRUE
.
We thus need to re-fit the first example (with cv.folds
set to 1 and keep.data=TRUE
) to show the usage of this
function.
<- BT(formula = as.formula("Y_normalized ~ Age + Sport + Split + Gender"),
BT_algo data = dataset,
ABT = F,
n.iter = 300,
train.fraction = 0.8,
interaction.depth = 4,
shrinkage = 0.01,
bag.fraction = 0.5,
colsample.bytree = 3,
keep.data = T,
is.verbose = F,
cv.folds = 1,
folds.id = NULL,
n.cores = 1,
weights = ExpoR,
seed = 44)
We can then call the BT_more
function to perform 100 new
iterations and use the new object as an usual BTFit
one.
<- BT_more(BT_algo, new.n.iter = 100, seed = 404)
BT_algo_contd # See parameter, predict, ...
$BTParams$n.iter BT_algo_contd
## [1] 400
head(predict(BT_algo_contd, n.iter = 350, type='link'), 10)
## As newdata is missing or is not a data frame, the training set has been used thanks to the keep.data = TRUE parameter.
## [1] -1.877675 -2.013065 -1.912127 -1.891241 -1.844825 -1.867253 -2.096570
## [8] -1.993146 -2.056822 -2.284128
In our first example, we decided to fit a Boosting Tree approach. However, one can be interested in fitting an Adaptive Boosting Tree one. We note that the cross-validation is relevant for the former case while it might not be needed for the later one. In fact, by construction the Adaptive Boosting Tree will naturally ends up to root node, which will not bring more information. This is then a natural stopping criterion.
This task can easily be achieved taking the same example as before
and setting ABT
to TRUE
instead of
FALSE
. It is then left to the interested reader.