Guide to grpsel

Ryan Thompson

Introduction

grpsel is an R package for group subset selection (see https://arxiv.org/abs/2105.12081). For a response vector \(\mathbf{y}=(y_1,\ldots,y_n)^\top\), predictor matrix \(\mathbf{X}=(\mathbf{x}_1,\ldots,\mathbf{x}_n)^\top\), and a set of \(g\) groups, grpsel is capable of approximately solving problems of the form: \[ \min_\beta \sum_{i=1}^n\ell(\mathbf{x}_i^\top\beta,y_i)+\lambda\sum_{k=1}^g1(\|\beta_k\|_2\neq0)+\gamma\sum_{k=1}^g\|\beta_k\|_2^q,\quad q\in\{1,2\} \] where the \(\ell\) is a loss function (square or logistic), the second term is a group subset selection penalty, and the third term is a shrinkage penalty (specifically, a group lasso penalty if \(q=1\) and a ridge penalty if \(q=2\)). The notation \(\beta_k\) denotes the coefficients belonging to the \(k\)th group.

Group-sparse regression arises in numerous settings in modern data analytic work, including selection with categorical predictors, multitask (multiresponse) learning, hierarchical selection, and nonparametric additive regression. We demonstrate some examples of these applications below.

The grpsel package provides a simple set of functions for handling grouped selection in R. The two main functions provided by the package are grpsel() and cv.grpsel(), responsible for model fitting and cross-validation, respectively.

The grpsel() function provides a convenient way of performing group subset selection for a path of \(\lambda\). To demonstrate this functionality, let’s simulate some grouped data.

set.seed(123)
n <- 100 # Number of observations
p <- 10 # Number of predictors
g <- 5 # Number of groups
group <- rep(1:g, each = p / g) # Group structure
beta <- numeric(p)
beta[which(group %in% 1:2)] <- 1 # First two groups are nonzero
x <- matrix(rnorm(n * p), n, p)
y <- x %*% beta + rnorm(n)

The first two groups explain the response while the rest are unimportant.

library(grpsel)
fit <- grpsel(x, y, group)

The group argument is optional, and if left unspecified each predictor will be assigned to its own group (leading to ungrouped variable selection).

The values of \(\lambda\) are automatically computed from the data, providing a path of solutions from the null model to the full model. These solutions can be extracted using the coef() function.

coef(fit)
#>            [,1]      [,2]      [,3]        [,4]        [,5]        [,6]
#>  [1,] 0.1870419 0.1409381 0.1363218  0.14577730  0.14818753  0.13692499
#>  [2,] 0.0000000 0.0000000 1.0738569  1.06824451  1.05330980  1.06460402
#>  [3,] 0.0000000 0.0000000 0.9734314  0.98498488  0.98057885  0.99572109
#>  [4,] 0.0000000 0.7399182 0.8432187  0.83820250  0.84769571  0.85286255
#>  [5,] 0.0000000 1.1879370 1.1940502  1.15017331  1.14165639  1.13813606
#>  [6,] 0.0000000 0.0000000 0.0000000  0.00000000  0.00000000  0.07018109
#>  [7,] 0.0000000 0.0000000 0.0000000  0.00000000  0.00000000 -0.03730132
#>  [8,] 0.0000000 0.0000000 0.0000000  0.00000000  0.09499780  0.09295309
#>  [9,] 0.0000000 0.0000000 0.0000000  0.00000000  0.09697182  0.10964882
#> [10,] 0.0000000 0.0000000 0.0000000 -0.06559595 -0.05491543 -0.04950011
#> [11,] 0.0000000 0.0000000 0.0000000  0.13294028  0.13484070  0.13700323

Each of the columns above correspond to a set of estimated coefficients for a particular value of \(\lambda\), with the first row containing the intercept terms. These coefficients can be visualised via the plot() function.

plot(fit)

The plot above omits the intercept terms and displays the coefficients by the number of selected groups rather than \(\lambda\).

The predict() function is available to make predictions for new data.

x.new <- matrix(rnorm(10 * p), 10, p)
predict(fit, x.new)
#>            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
#>  [1,] 0.1870419  0.2474752  0.5547334  0.4828203  0.5505736  0.5159036
#>  [2,] 0.1870419 -1.0364443  0.6014125  0.9496742  1.0179243  1.0540782
#>  [3,] 0.1870419  1.7299342  3.2266514  3.5362925  3.4558820  3.4516341
#>  [4,] 0.1870419 -1.1671613 -3.5933709 -3.4910646 -3.4566146 -3.4945966
#>  [5,] 0.1870419 -1.4625161 -0.3259062 -0.4122339 -0.4798003 -0.5027243
#>  [6,] 0.1870419  1.6731375  2.5022259  2.4825533  2.4170991  2.3979009
#>  [7,] 0.1870419 -0.8084087 -1.9495458 -1.9342494 -1.9434742 -2.0254337
#>  [8,] 0.1870419  1.3557895  0.9618110  1.1240601  1.1827956  1.2321711
#>  [9,] 0.1870419  2.2387394  2.3907195  2.1533349  1.8768550  1.8623886
#> [10,] 0.1870419 -1.2059456 -3.1525815 -2.9586341 -2.9207081 -2.9396107

Again, the columns represent predictions for different values of \(\lambda\).

By default, grpsel sets \(\gamma=0\). In certain situations, the shrinkage induced by the setting \(\gamma>0\) is desirable (e.g., when the level of noise is high). To fit the model for both \(\lambda\) and \(\gamma\), use the argument penalty='grSubset+grLasso' or penalty='grSubset+Ridge'. To extract the coefficients for specific values of \(\lambda\) and \(\gamma\), the lambda and gamma arguments of coef() are available.

fit <- grpsel(x, y, group, penalty = 'grSubset+grLasso')
coef(fit, lambda = 0.05, gamma = 0.1)
#>            [,1]
#>  [1,] 0.1470187
#>  [2,] 0.9070640
#>  [3,] 0.8487866
#>  [4,] 0.7228066
#>  [5,] 1.0426958
#>  [6,] 0.0000000
#>  [7,] 0.0000000
#>  [8,] 0.0000000
#>  [9,] 0.0000000
#> [10,] 0.0000000
#> [11,] 0.0000000

fit <- grpsel(x, y, group, penalty = 'grSubset+Ridge')
coef(fit, lambda = 0.05, gamma = 0.1)
#>            [,1]
#>  [1,] 0.1437223
#>  [2,] 0.9682111
#>  [3,] 0.8967337
#>  [4,] 0.7617122
#>  [5,] 1.0913344
#>  [6,] 0.0000000
#>  [7,] 0.0000000
#>  [8,] 0.0000000
#>  [9,] 0.0000000
#> [10,] 0.0000000
#> [11,] 0.0000000

Similar arguments exist for predict().

In practice, \(\lambda\) and \(\gamma\) usually need to be cross-validated. The cv.grpsel() function provides a convenient way to perform group subset selection with cross-validation.

cvfit <- cv.grpsel(x, y, group, penalty = 'grSubset+Ridge', nfold = 10) # 10-fold cross-validation

The cross-validation results are easily visualised using the plot() function.

plot(cvfit)

The plot above shows the cross-validated loss (square or logistic) as a function of the number of selected groups for the best cross-validated value of \(\gamma\). Plots for other values of \(\gamma\) can be produced using the gamma argument of plot(). From the plot, it is clear that the minimum loss is attained when two groups are selected, which so happens to be the truth.

The coef() and predict() functions applied to the output of cv.grpsel() return results corresponding to the values of \(\lambda\) and \(\gamma\) that minimise the cross-validated mean square error.

coef(cvfit)
#>            [,1]
#>  [1,] 0.1363400
#>  [2,] 1.0736047
#>  [3,] 0.9732522
#>  [4,] 0.8430242
#>  [5,] 1.1938079
#>  [6,] 0.0000000
#>  [7,] 0.0000000
#>  [8,] 0.0000000
#>  [9,] 0.0000000
#> [10,] 0.0000000
#> [11,] 0.0000000
predict(cvfit, x.new)
#>             [,1]
#>  [1,]  0.5546389
#>  [2,]  0.6013601
#>  [3,]  3.2260865
#>  [4,] -3.5925602
#>  [5,] -0.3257680
#>  [6,]  2.5017141
#>  [7,] -1.9490707
#>  [8,]  0.9616608
#>  [9,]  2.3903475
#> [10,] -3.1519016

Note that grpsel() does not need to be run after using cv.grpsel(), as the latter calls the former and saves the result as cvfit$fit.

Finally, to perform a logistic regression fit, simply set loss='logistic'.

y <- pmax(sign(y), 0)

fit <- cv.grpsel(x, y, group, loss = 'logistic')
coef(fit)
#>            [,1]
#>  [1,] -0.189807
#>  [2,]  2.989586
#>  [3,]  2.618572
#>  [4,]  2.560171
#>  [5,]  4.112287
#>  [6,]  0.000000
#>  [7,]  0.000000
#>  [8,]  0.000000
#>  [9,]  0.000000
#> [10,]  0.000000
#> [11,]  0.000000

Overlapping groups

It is straightforward to model overlapping groups using grpsel. To demonstrate, suppose there are five predictors spread among two groups: \(\{1,2,3\}\) and \(\{3,4,5\}\), where \(x_3\) belongs to both groups.

p <- 5
x <- matrix(rnorm(n * p), n, p)
y <- rowSums(x) + rnorm(n)
group <- list(c(1, 2, 3), c(3, 4, 5))
fit <- grpsel(x, y, group)

The object group is now a list of groups rather than a vector.

coef(fit)
#>           [,1]      [,2]       [,3]
#> [1,] 0.2896566 0.1925735 0.07833032
#> [2,] 0.0000000 0.0000000 0.82991726
#> [3,] 0.0000000 0.0000000 0.83024233
#> [4,] 0.0000000 0.9211489 0.95257093
#> [5,] 0.0000000 1.0932975 1.06188207
#> [6,] 0.0000000 1.2015011 1.09929582

Under the hood, the overlapping groups are handled using a latent coefficient approach. See https://arxiv.org/abs/2105.12081 for more information.

Algorithms

The primary algorithm driving grpsel is coordinate descent. Sometimes when the groups are strongly correlated the estimate produced by coordinate descent can be improved using local search, a custom algorithm that runs on top of coordinate descent. To use local search, set ls=T.

p <- 10
group <- rep(1:g, each = p / g)
Sigma <- 0.9 ^ t(sapply(1:p, function(i, j) abs(i - j), 1:p))
x <- matrix(rnorm(n * p), n, p)
x <- t(chol(Sigma) %*% t(x))
beta[which(group %in% 1:2)] <- 1 # First two groups are nonzero
y <- x %*% beta + rnorm(n)
fit <- cv.grpsel(x, y, group)
coef(fit)
#>                [,1]
#>  [1,]  0.1560208707
#>  [2,]  1.0701733192
#>  [3,]  1.0187848193
#>  [4,]  1.0497911636
#>  [5,]  0.5578552551
#>  [6,] -0.2591440135
#>  [7,]  0.4552621485
#>  [8,]  0.0000000000
#>  [9,]  0.0000000000
#> [10,] -0.0007391213
#> [11,]  0.3424805030
fit <- cv.grpsel(x, y, group, ls = T)
coef(fit)
#>            [,1]
#>  [1,] 0.1704756
#>  [2,] 1.0332436
#>  [3,] 1.1197523
#>  [4,] 1.0806977
#>  [5,] 0.6096263
#>  [6,] 0.0000000
#>  [7,] 0.0000000
#>  [8,] 0.0000000
#>  [9,] 0.0000000
#> [10,] 0.0000000
#> [11,] 0.0000000

In this high-correlation example, the correct groups are not selected without local search.

Demonstrations

In this section we show how grpsel can be used to fit a variety of statistical models.

Multitask learning

Multitask learning is useful where the response is a matrix and it is suspected that each of its columns can be explained by the same set of predictors. In this case, we would like to perform a single fit rather than individual fits on each column.

In this example, we will simulate ten response variables, each of which depend on the first five predictors.

n <- 100
p <- 10
m <- 10 # Number of response variables
beta <- matrix(0, p, m)
beta[1:5, ] <- 1
x <- matrix(rnorm(n * p), n, p)
y <- x %*% beta + matrix(rnorm(n * m), n, m)
x <- scale(x)
y <- scale(y)

y <- matrix(y, ncol = 1)
x <- diag(m) %x% x
group <- rep(1:p, m)

cvfit <- cv.grpsel(x, y, group)
matrix(coef(cvfit)[- 1, , drop = F], ncol = m)
#>            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
#>  [1,] 0.4793839 0.4123002 0.3941389 0.3875573 0.4335534 0.4307025 0.4252054
#>  [2,] 0.4113696 0.3305919 0.3313272 0.3760008 0.3508267 0.3636038 0.2994856
#>  [3,] 0.3648536 0.3709711 0.4294716 0.4297163 0.3739155 0.3957188 0.4578894
#>  [4,] 0.4318631 0.4476063 0.4637400 0.3890519 0.4551207 0.5145601 0.4700297
#>  [5,] 0.3376399 0.4189209 0.3526391 0.4256793 0.4073170 0.3274007 0.3687804
#>  [6,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#>  [7,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#>  [8,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#>  [9,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [10,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#>            [,8]      [,9]     [,10]
#>  [1,] 0.3547322 0.4269889 0.3895732
#>  [2,] 0.3473706 0.3616040 0.3648181
#>  [3,] 0.4663693 0.3894830 0.4444560
#>  [4,] 0.4386384 0.4175312 0.4313189
#>  [5,] 0.3335061 0.4316870 0.4195322
#>  [6,] 0.0000000 0.0000000 0.0000000
#>  [7,] 0.0000000 0.0000000 0.0000000
#>  [8,] 0.0000000 0.0000000 0.0000000
#>  [9,] 0.0000000 0.0000000 0.0000000
#> [10,] 0.0000000 0.0000000 0.0000000

The groups have enforced the constraint that all ten response variables must share the same set of predictors.

Nonparametric additive regression

Group selection can be used to fit sparse nonparametric additive models of the form \[y=\sum_{j=1}^pf_j(x_j)+\epsilon.\] To fit these models we can approximate \(f_1,\ldots,f_p\) using regression splines. In this example, the predictors are uniform on \([0,1]\) and the generating functions are \[ \begin{aligned} f_1(x) &= \sin(2\pi x), \\ f_2(x) &= \cos(2\pi x), \\ f_3(x) &= 0,\,\text{and} \\ &~~\vdots \\ f_p(x) &= 0. \end{aligned} \] To fit this model we will use natural cubic splines with five basis functions. The basis functions of each spline form a group, and we have one such group for each predictor.

n <- 100
p <- 10
x <- matrix(runif(n * p), n, p)
y <- sinpi(2 * x[, 1]) + cospi(2 * x[, 2]) + rnorm(n, sd = 0.1)
df <- 5
splines <- lapply(1:p, function(j) splines::ns(x[, j], df = df))
x.s <- do.call(cbind, splines)
group <- rep(1:p, each = df)
fit <- cv.grpsel(x.s, y, group)

Let’s check that the first two predictors have been selected correctly.

unique(group[coef(fit)[- 1] != 0])
#> [1] 1 2

The fitted functions can be plotted as follows.

library(ggplot2)

beta0 <- coef(fit)[1]
beta <- coef(fit)[- 1]
int.x1 <- (beta0 + colMeans(x.s[, - (1:df)]) %*% beta[- (1:df)])[, ]
int.x2 <- (beta0 + colMeans(x.s[, - (1:df + df)]) %*% beta[- (1:df + df)])[, ]

ggplot(x = seq(- 1, 1, length.out = 101)) +
  stat_function(fun = function(x) sinpi(2 * x), aes(linetype = 'True function')) +
  stat_function(fun = function(x) int.x1 + predict(splines[[1]], x) %*% beta[1:df], aes(linetype = 'Fitted function')) +
  xlab('x') +
  ylab('f(x)')


ggplot(x = seq(- 1, 1, length.out = 101)) +
  stat_function(fun = function(x) cospi(2 * x), aes(linetype = 'True function')) +
  stat_function(fun = function(x) int.x2 + predict(splines[[2]], x) %*% beta[1:df + df], aes(linetype = 'Fitted function')) +
  xlab('x') +
  ylab('f(x)')

Hierarchical interactions

When modelling interactions between predictors it is often desirable to enforce the condition that an interaction can be selected only when its constituent predictors are selected, i.e., the coefficient on \(x_1x_2\) can be nonzero only when the coefficients on \(x_1\) and \(x_2\) are nonzero. It is straightforward to enforce this hierarchical constraint using group selection.

In this example, the main effects \(x_1\), \(x_2\), and \(x_3\) are nonzero, as well as the interaction \(x_1x_2\).

n <- 100
p <- 10
x <- matrix(rnorm(n * p), n, p)
y <- x[, 1] + x[, 2] + x[, 3] + x[, 1] * x[, 2] + rnorm(n)

x.int <- model.matrix(~ - 1 + . ^ 2, data = as.data.frame(x))
group <- c(1:p, mapply(c, combn(1:p, 2, simplify = F), 1:choose(p, 2) + p, SIMPLIFY = F))
fit <- cv.grpsel(x.int, y, group)
colnames(x.int)[which(coef(fit)[- 1] != 0)]
#> [1] "V1"    "V2"    "V3"    "V1:V2"

The fitted model respects the hierarchy constraint.