Introduction to vimp

Introduction

vimp is a package that computes nonparametric estimates of variable importance. Currently, the package supports flexible estimation of an ANOVA-derived measure of variable importance, interpreted as the additional proportion of variability in the outcome explained by including a group of covariates in estimation of the conditional mean. This quantity is a nonparametric generalization of the usual $$R^2$$ measure.

Variable importance estimates may be computed quickly, depending on the techniques used to estimate the underlying conditional means — if these techniques are slow, then the variable importance procedure will be slow. However, you may save computation time by estimating the conditional means separately, and plugging these estimates into the variable importance procedure.

The code can handle arbitrary dimensions of features, and may be used to estimate the importance of any single feature or group of features for predicting the outcome. The package also includes functions for cross-validated importance and plotting.

The author of the vimp package is Brian Williamson.

Installation

Currently, the package may only be downloaded and installed from GitHub using the devtools package. Type the following command in your R console:

# only run if you don't have devtools
# previously installed
# install.packages("devtools")
devtools::install_github("bdwilliamson/vimp")

Quick Start

This section should serve as a quick guide to using the vimp package — we will cover the main functions using a simulated data example. More details are given in the next section.

First, load the vimp package:

library("vimp")

Next, create some data:

## -------------------------------------------------------------
## problem setup
## -------------------------------------------------------------
## set up the data
n <- 100
p <- 2
s <- 1 # desire importance for X_1
x <- data.frame(replicate(p, runif(n, -1, 1)))
y <- (x[,1])^2*(x[,1]+7/5) + (25/9)*(x[,2])^2 + rnorm(n, 0, 1) 

This creates a matrix of covariates x with two columns, and a vector y of normally-distributed outcome values, for a sample of n = 100 study participants.

The workhorse function of vimp, for ANOVA-based variable importance, is vimp_regression. The basic arguments are

• Y: the outcome (in this example, y)
• X: the covariates (in this example, x)
• indx: the covariate(s) of interest for evaluating importance (here, either 1 or 2)
• run_regression: a logical value telling vimp_regression whether or not to run a regression of Y on X
• SL.library: a “library” of learners to pass to the function SuperLearner, if run_regression = TRUE

This final argument, SL.library, determines the estimators you want to use for the conditional mean of Y given X. Estimates of variable importance rely on good estimators of the conditional mean, so we suggest using flexible estimators and model stacking to do so. One option for this is the SuperLearner package; load that package using

library("SuperLearner")
## Loading required package: nnls
## Super Learner
## Version: 2.0-25
## Package created on 2019-08-05
## load specific algorithms
library("gam")
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.16
library("xgboost")

The code

est_1 <- vimp_regression(Y = y, X = x, indx = 1, run_regression = TRUE,
SL.library = c("SL.gam", "SL.xgboost", "SL.mean"))
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

uses the Super Learner to fit the required regression functions, and computes an estimate of variable importance. We can visualize the estimate, standard error, and confidence interval by printing or typing the object name:

est_1
## Call:
## vimp_regression(Y = y, X = x, indx = 1, run_regression = TRUE,
##     SL.library = c("SL.gam", "SL.xgboost", "SL.mean"))
##
## Variable importance estimates:
##       Estimate  SE         95% CI
## s = 1 0.3666897 0.05224283 [0.2642956, 0.4690838]
print(est_1)
## Call:
## vimp_regression(Y = y, X = x, indx = 1, run_regression = TRUE,
##     SL.library = c("SL.gam", "SL.xgboost", "SL.mean"))
##
## Variable importance estimates:
##       Estimate  SE         95% CI
## s = 1 0.3666897 0.05224283 [0.2642956, 0.4690838]

This output shows that we have estimated the importance of $$X_2$$ to be 0.51, with a 95% confidence interval of [0.37, 0.65].

Detailed guide

Often when working with data we attempt to estimate the conditional mean of the outcome $$Y$$ given features $$X$$, defined as $$\mu_P(x) = E_P(Y \mid X = x)$$.

There are many tools for estimating this conditional mean. We might choose a classical parametric tool such as linear regression. We might also want to be model-agnostic and use a more nonparametric approach to estimate the conditional mean. However,

• This involves using some nonparametric smoothing technique, which requires: (1) choosing a technique, and (2) selecting tuning parameters
• Naive optimal tuning balances out the bias and variance of the smoothing estimator. Is this the correct trade-off for estimating the conditional mean?

Once we have a good estimate of the conditional mean, it is often of scientific interest to understand which features contribute the most to the variation in $$\mu_P$$. Specifically, we might consider $\mu_{P, s}(x) = E_P(Y \mid X_{(-s)} = x_{(-s)}),$ where for a vector $$v$$ and a set of indices $$s$$, $$v_{-(s)}$$ denotes the elements of $$v$$ with index not in $$s$$. By comparing $$\mu_{P, s}$$ to $$\mu_P$$ we can evaluate the importance of the $$s$$th element (or group of elements).

Assume that our data are generated according to the mechanism $$P_0$$. We can then define a nonparametric measure of variable importance, $\psi_{0, s} = \frac{\int [\mu_{P_0}(x) - \mu_{P_0, s}(x)]^2dP_0(x)}{\text{Var}_{P_0}(Y)},$ which is the proportion of the variability in the outcome explained by including $$X_j$$ in our chosen estimation technique.

This document introduces you to the basic tools in vimp and how to apply them to a dataset. I will explore the two different ways of obtaining variable estimates using vimp:

1. You only specify a library of candidate estimators for the conditional means $$\mu_{P_0}$$ and $$\mu_{P_0, s}$$; you allow vimp to obtain the optimal estimates of these quantities using the SuperLearner (van der Laan, Polley, and Hubbard 2007), and use these estimates to obtain variable importance estimates
2. You have a favorite estimator for the conditional means; you simply want vimp to obtain variable importance estimates using this estimator

A look at the Boston housing study data

Throughout this document I will use the Boston housing study data (Harrison and Rubinfeld 1978), freely available from the MASS package. Use ?Boston to see documentation for these data.

## load the library, view the data
library("MASS")
data(Boston)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

In addition to the median house value medv, the outcome of interest, there are measurements on four groups of variables. First are accessibility features: the weighted distances to five employment centers in the Boston region dis, where housing prices are expected to increase with decreased distance to employment centers; and an index of accessibility to radial highways rad, where housing prices are expected to increase with increased access to highways. Second are neighborhood features: the proportion of black residents in the population black; the proportion of population that is lower status lstat, which denotes adults without some high school education and male workers classified as laborers; the crime rate crim; the proportion of a town’s residential land zoned for lots greater than 25,000 square feet zn; the proportion of nonretail business acres per town indus; the full value property tax rate tax; the pupil-teacher ratio by school district ptratio; and an indicator of whether the tract of land borders the Charles River chas. The third group are structural features: the average number of rooms in owner units rm; and the proportion of owner units built prior to 1940 age. The final group is the nitrogen oxide concentration nox.

Since there are 13 features and four groups, it is of interest to determine variable importance both for the 13 features separately and for the four groups of features.

A first approach: linear regression

Suppose that I believe that a linear model truly holds in the Boston housing data. In that case, I would be justified in only fitting a linear regression to estimate the conditional means; this means that in my importance analysis, I should also use only linear regression. This is achieved by the following:

## estimate the full conditional mean using linear regression
full.mod <- lm(medv ~ ., data = Boston)
full.fit <- predict(full.mod)

## estimate the reduced conditional means for each of the individual variables
X <- as.matrix(Boston[, -14]) # remove the outcome for the predictor matrix

red.mod.crim <- lm(full.fit ~ X[,-1])
red.fit.crim <- predict(red.mod.crim)

red.mod.zn <- lm(full.fit ~ X[,-2])
red.fit.zn <- predict(red.mod.zn)

red.mod.indus <- lm(full.fit ~ X[,-3])
red.fit.indus <- predict(red.mod.indus)

red.mod.chas <- lm(full.fit ~ X[,-4])
red.fit.chas <- predict(red.mod.chas)

red.mod.nox <- lm(full.fit ~ X[,-5])
red.fit.nox <- predict(red.mod.nox)

red.mod.rm <- lm(full.fit ~ X[, -6])
red.fit.rm <- predict(red.mod.rm)

red.mod.age <- lm(full.fit ~ X[,-7])
red.fit.age <- predict(red.mod.age)

red.mod.dis <- lm(full.fit ~ X[, -8])
red.fit.dis <- predict(red.mod.dis)

red.mod.tax <- lm(full.fit ~ X[,-10])
red.fit.tax <- predict(red.mod.tax)

red.mod.ptratio <- lm(full.fit ~ X[,-11])
red.fit.ptratio <- predict(red.mod.ptratio)

red.mod.black <- lm(full.fit ~ X[,-12])
red.fit.black <- predict(red.mod.black)

red.mod.lstat <- lm(full.fit ~ X[,-13])
red.fit.lstat <- predict(red.mod.lstat)

library("vimp")

## plug these into vim
lm.vim.crim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.crim, indx = 1, run_regression = FALSE) lm.vim.zn <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.zn, indx = 2, run_regression = FALSE)
lm.vim.indus <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.indus, indx = 3, run_regression = FALSE) lm.vim.chas <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.chas, indx = 4, run_regression = FALSE)
lm.vim.nox <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.nox, indx = 5, run_regression = FALSE) lm.vim.rm <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.rm, indx = 6, run_regression = FALSE)
lm.vim.age <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.age, indx = 7, run_regression = FALSE) lm.vim.dis <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.dis, indx = 8, run_regression = FALSE)
lm.vim.rad <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.rad, indx = 9, run_regression = FALSE) lm.vim.tax <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.tax, indx = 10, run_regression = FALSE)
lm.vim.ptratio <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.ptratio, indx = 11, run_regression = FALSE) lm.vim.black <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.black, indx = 12, run_regression = FALSE)
lm.vim.lstat <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit.lstat, indx = 13, run_regression = FALSE) ## make a table with the estimates using the merge_vim() function lm.mat <- merge_vim(lm.vim.crim, lm.vim.zn, lm.vim.indus, lm.vim.chas, lm.vim.nox, lm.vim.rm, lm.vim.age, lm.vim.dis, lm.vim.rad, lm.vim.tax, lm.vim.ptratio, lm.vim.black, lm.vim.lstat) ## print out the matrix lm.mat$mat
##                         est           se           cil          ciu
## lm.vim.lstat   5.643838e-02 2.279363e-02  0.0117636818 0.1011130864
## lm.vim.rm      4.380820e-02 1.694131e-02  0.0106038484 0.0770125548
## lm.vim.dis     2.885111e-02 7.474835e-03  0.0142007022 0.0435015157
## lm.vim.ptratio 2.795733e-02 6.834094e-03  0.0145627519 0.0413519065
## lm.vim.nox     1.140445e-02 4.755140e-03  0.0020845435 0.0207243480
## lm.vim.rad     1.121712e-02 4.348758e-03  0.0026937144 0.0197405309
## lm.vim.black   6.335620e-03 3.702814e-03 -0.0009217619 0.0135930026
## lm.vim.zn      6.027980e-03 3.608834e-03 -0.0010452051 0.0131011654
## lm.vim.crim    5.693839e-03 4.275991e-03 -0.0026869488 0.0140746263
## lm.vim.tax     5.671312e-03 2.484119e-03  0.0008025278 0.0105400962
## lm.vim.chas    5.126155e-03 4.834211e-03 -0.0043487238 0.0146010341
## lm.vim.indus   5.891587e-05 2.840487e-04 -0.0004978094 0.0006156412
## lm.vim.age     1.447559e-06 6.784973e-05 -0.0001315355 0.0001344306

Building a library of learners

In general, we don’t believe that a linear model truly holds. Thinking about potential model misspecification leads us to consider other algorithms. Suppose that I prefer to use generalized additive models (Hastie and Tibshirani 1990) to estimate $$\mu_{P_0}$$ and $$\mu_{P_0, s}$$, so I am planning on using the gam package. Suppose that you prefer to use the elastic net (Zou and Hastie 2005), and are planning to use the glmnet package.

The choice of either method is somewhat subjective, and I also will have to use a technique like cross-validation to determine an optimal tuning parameter in each case. It is also possible that neither additive models nor the elastic net will do a good job estimating the true conditional means!

This motivates using SuperLearner to allow the data to determine the optimal combination of base learners from a library that I define. These base learners are a combination of different methods (e.g., generalized additive models and elastic net) and instances of the same method with different tuning parameter values (e.g., additive models with 3 and 4 degrees of freedom). The Super Learner is an example of model stacking, or model aggregation — these approaches use a data-adaptive combination of base learners to make predictions.

For instance, my library could include additive models, elastic net , random forests (Breiman 2001), and gradient boosted trees (Friedman 2001) as follows:

## load the library
library(SuperLearner)

## create a function for boosted stumps
SL.gbm.1 <- function(..., interaction.depth = 1) SL.gbm(..., interaction.depth = interaction.depth)

## create GAMs with different degrees of freedom
SL.gam.3 <- function(..., deg.gam = 3) SL.gam(..., deg.gam = deg.gam)
SL.gam.4 <- function(..., deg.gam = 4) SL.gam(..., deg.gam = deg.gam)
SL.gam.5 <- function(..., deg.gam = 5) SL.gam(..., deg.gam = deg.gam)

## add more levels of alpha for glmnet
create.SL.glmnet <- function(alpha = c(0.25, 0.5, 0.75)) {
for (mm in seq(length(alpha))) {
eval(parse(file = "", text = paste('SL.glmnet.', alpha[mm], '<- function(..., alpha = ', alpha[mm], ') SL.glmnet(..., alpha = alpha)', sep = '')), envir = .GlobalEnv)
}
invisible(TRUE)
}
create.SL.glmnet()

## add tuning parameters for randomForest
create.SL.randomForest <- function(tune = list(mtry = c(1, 5, 7), nodesize = c(1, 5, 10))) {
tuneGrid <- expand.grid(tune, stringsAsFactors = FALSE)
for (mm in seq(nrow(tuneGrid))) {
eval(parse(file = "", text = paste("SL.randomForest.", mm, "<- function(..., mtry = ", tuneGrid[mm, 1], ", nodesize = ", tuneGrid[mm, 2], ") SL.randomForest(..., mtry = mtry, nodesize = nodesize)", sep = "")), envir = .GlobalEnv)
}
invisible(TRUE)
}
create.SL.randomForest()

## create the library
learners <- c("SL.gam", "SL.gam.3", "SL.gam.4", "SL.gam.5",
"SL.glmnet", "SL.glmnet.0.25", "SL.glmnet.0.5", "SL.glmnet.0.75",
"SL.randomForest", "SL.randomForest.1", "SL.randomForest.2", "SL.randomForest.3",
"SL.randomForest.4", "SL.randomForest.5", "SL.randomForest.6", "SL.randomForest.7",
"SL.randomForest.8", "SL.randomForest.9",
"SL.gbm.1")

Now that I have created the library of learners, I can move on to estimating variable importance.

Estimating variable importance for a single variable

The main function for ANOVA-derived variable importance in the vimp package is the vimp_regression() function. There are five main arguments to vimp_regression():

• Y, the outcome
• f1 and f2, the fitted values from a sequential regression procedure; or X, the covariates
• indx, which determines the feature I want to estimate variable importance for
• run_regression, which determines whether or not the sequential regression procedure is run on Y and X

There are two ways to compute importance:

1. Supply outcome Y, covariates X, and a library of learners (e.g., learners above) with run_regression = TRUE
2. Supply outcome Y and fitted values for estimates of $$\mu_{P_0}$$ and $$\mu_{P_0, s}$$ with run_regression = FALSE

I will illustrate each of these choices in order below, but in practice I use (3), since it allows the most flexibility, and all time-intensive computation occurs before calling vimp_regression(); however, this involves more work on your end.

Suppose that the first feature that I want to estimate variable importance for is nitrogen oxide, nox. Since this is the first feature, say I choose (1) above. Then supplying vimp_regression() with

• Y = Boston$medv • X = X • indx = 5 • run_regression = TRUE means that: • I want to use SuperLearner() to estimate the conditional means $$\mu_{P_0}$$ and $$\mu_{P_0,s}$$ • I want to estimate variable importance for the fifth column of the Boston covariates, which is nox The call to vimp_regression() looks like this: ## load the library library("vimp") ## first re-order the data so that the outcome is in the first column Boston2 <- Boston[, 1:13] ## now estimate variable importance vimp_regression(Y = Boston$medv, X = Boston2[, -1, drop = FALSE],
indx = 5, run_regression = TRUE, SL.library = learners)

While this is the preferred method for estimating variable importance, using a large library of learners may cause the function to take time to run. Usually this is okay — in general, you took a long time to collect the data, so letting an algorithm run for a few hours should not be an issue.

However, for the sake of illustration, I can estimate varibable importance for nitrogen oxide only using only two base learners as follows:

## load the library
library("vimp")

## first re-order the data so that the outcome is in the first column
Boston2 <- Boston[, 1:13]

## new learners library, with only one learner for illustration only
learners.2 <- c("SL.xgboost", "SL.glmnet")

## now estimate variable importance
nox.vim <- vimp_regression(Y = Boston$medv, X = Boston2[, -1, drop = FALSE], indx = 5, run_regression = TRUE, SL.library = learners.2) This code takes approximately 11 seconds to run on a (not very fast) PC. Under the hood, vimp_regression() fits the SuperLearner() function with the specified library, and then returns fitted values and variable importance estimates. This is most suitable for estimating variable importance for the first feature on a given dataset. I can display these estimates: nox.vim ## Call: ## vimp_regression(Y = Boston$medv, X = Boston2[, -1, drop = FALSE],
##     indx = 5, run_regression = TRUE, SL.library = learners.2)
##
## Variable importance estimates:
##       Estimate   SE          95% CI
## s = 5 0.03065431 0.002376398 [0.02599666, 0.03531197]

The object returned by vimp_regression() also contains fitted values from using SuperLearner(); I access these using $full_fit and $red_fit. For example,

head(nox.vim$full_fit) ## [,1] ## 1 24.61884 ## 2 21.85287 ## 3 34.33884 ## 4 33.01635 ## 5 35.33925 ## 6 28.42642 head(nox.vim$red_fit)
##       [,1]
## 1 26.19936
## 2 22.47551
## 3 33.11445
## 4 31.87244
## 5 32.71269
## 6 28.47166

I said earlier that I want to obtain estimates of all individual features in these data, so let’s choose average number of rooms (rm) next. Now that I have estimated variable importance for nitrogen oxide, the full.fit object contains our estimate of $$\mu_{P_0}$$. Since I have spent the time to estimate this using SuperLearner(), there is no reason to estimate this function again. This leads me to choose (2) above, since I have already estimated variable importance on one feature in this dataset. Using the small learners library (again only for illustration) yields

## specify that full.fit doesn't change
full.fit <- nox.vim$full_fit ## estimate variable importance for the average number of rooms reduced_fit <- SuperLearner::SuperLearner(Y = full.fit, X = Boston2[, -c(1, 6), drop = FALSE], SL.library = learners.2) red.fit <- predict(reduced_fit)$pred
rm.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit, indx = 6, run_regression = FALSE) rm.vim ## Call: ## vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = red.fit,
##     indx = 6, run_regression = FALSE)
##
## Variable importance estimates:
##       Estimate   SE         95% CI
## s = 6 0.03266402 0.00251907 [0.02772673, 0.03760131]

This takes approximately 5 seconds — now rather than estimating both conditional means, I am only estimating one.

If I choose (2), then I have to use a single method from the library, or call SuperLearner() myself, prior to estimating variable importance. Then vimp_regression() returns variable importance estimates based on these fitted values. For example, let’s estimate variable importance for the distance to radial highways using this approach.

## set up the data
x <- Boston[, -c(8, 14)] # this removes dis and medv

## fit a GAM and glmnet using SuperLearner
reduced.mod <- SuperLearner(Y = full.fit, X = x, SL.library = learners.2)
reduced.fit <- predict(reduced.mod)$pred ## this takes 2 seconds ## estimate variable importance dis.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced.fit, indx = 8, run_regression = FALSE)

I can obtain estimates for the remaining individual features in the same way (again using only two base learners for illustration):

reduced_crim <- predict(SuperLearner(Y = full.fit, X = Boston[, -c(1, 14)], SL.library = learners.2))$pred crim.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_crim,
indx = 1, run_regression = FALSE)

reduced_zn <- predict(SuperLearner(Y = full.fit, X = Boston[, -c(2, 14)], SL.library = learners.2))$pred zn.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_zn,
indx = 2, run_regression = FALSE)

reduced_indus <- predict(SuperLearner(Y = full.fit, X = Boston[, -c(3, 14)], SL.library = learners.2))$pred indus.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_indus,
indx = 3, run_regression = FALSE)

reduced_chas <- predict(SuperLearner(Y = full.fit, X = Boston[, -c(4, 14)], SL.library = learners.2))$pred chas.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_chas,
indx = 4, run_regression = FALSE)

reduced_age <- predict(SuperLearner(Y = full.fit, X = Boston[, -c(7, 14)], SL.library = learners.2))$pred age.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_age,
indx = 7, run_regression = FALSE)

reduced_rad <- predict(SuperLearner(Y = full.fit, X = Boston[, -c(9, 14)], SL.library = learners.2))$pred rad.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_rad,
indx = 9, run_regression = FALSE)

reduced_tax <- predict(SuperLearner(Y = full.fit, X = Boston[, -c(10, 14)], SL.library = learners.2))$pred tax.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_tax,
indx = 10, run_regression = FALSE)

reduced_ptratio <- predict(SuperLearner(Y = full.fit, X = Boston[, -c(11, 14)], SL.library = learners.2))$pred ptratio.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_ptratio,
indx = 11, run_regression = FALSE)

reduced_black <- predict(SuperLearner(Y = full.fit, X = Boston[, -c(12, 14)], SL.library = learners.2))$pred black.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_black,
indx = 12, run_regression = FALSE)

reduced_lstat <- predict(SuperLearner(Y = full.fit, X = Boston[, -c(13, 14)], SL.library = learners.2))$pred lstat.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_lstat,
indx = 13, run_regression = FALSE)

Now that I have estimates of each of individual feature’s variable importance, I can view them all simultaneously by plotting:

## combine the objects together
ests <- merge_vim(crim.vim, zn.vim, indus.vim, chas.vim,
tax.vim, ptratio.vim, black.vim, lstat.vim)

## create a vector of names; must be in the same order as the
## mat object in ests
nms <- c("Prop. lower status", "Avg. num. rooms", "Pupil-teacher ratio", "Nitrogen oxide", "Distance", "Crime", "Access to radial hwys", "Property tax rate", "Charles riv.", "Prop. black", "Prop. business", "Prop. large zoned", "Age")

## plot
plot(ests, nms, pch = 16, ylab = "", xlab = "Estimate", main = "Estimated variable importance for individual features", xlim = c(0, 0.2))

Estimating variable importance for a group of variables

Now that I have estimated variable importance for each of the individual features, I can estimate variable importance for each of the groups that I mentioned above: accessibility features, structural features, nitrogen oxide, and neighborhood features.

The only difference between estimating variable importance for a group of features rather than an individual feature is that now I specify a vector for s; I can use any of the options listed in the previous section to compute these estimates.

## get the estimates
reduced_struct <- predict(SuperLearner(Y = Boston$medv, X = Boston[, -c(6, 7, 14)], SL.library = learners.2))$pred
structure.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_struct, indx = c(6, 7), run_regression = FALSE) reduced_access <- predict(SuperLearner(Y = Boston$medv, X = Boston[, -c(8, 9, 14)], SL.library = learners.2))$pred access.vim <- vimp_regression(Y = Boston$medv, f1 = full.fit, f2 = reduced_access, indx = c(8, 9), run_regression = FALSE)

reduced_neigh <- predict(SuperLearner(Y = Boston$medv, X = Boston[, -c(1, 2, 3, 4, 10, 11, 12, 13, 14)], SL.library = learners.2))$pred
neigh.vim <- vimp_regression(Y = Boston\$medv, f1 = full.fit, f2 = reduced_access,
indx = c(1, 2, 3, 4, 10, 11, 12, 13), run_regression = FALSE)

## combine and plot
groups <- merge_vim(structure.vim, access.vim, neigh.vim, nox.vim)
nms.2 <- c("Neighborhood", "Structure", "Accessibility", "Nitrogen oxide")
plot(groups, nms.2, pch = 16, ylab = "", xlab = "Estimate", main = "Estimated variable importance for groups", xlim = c(0, 0.5))

References

Breiman, L. 2001. “Random Forests” 45. Machine Learning.

Friedman, JH. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Applied Statistics.

Harrison, D, and DL Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air” 5. Journal of Environmental Economics and Management.

Hastie, TJ, and RJ Tibshirani. 1990. Generalized Additive Models. Vol. 43. CRC Press.

van der Laan, MJ, EC Polley, and AE Hubbard. 2007. “Super Learner” 6. Statistical Applications in Genetics and Molecular Biology.

Zou, H, and TJ Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).