# Introduction

Sparse-group SLOPE (SGS) is a penalised regression approach that performs bi-level selection with FDR control under orthogonal designs. SGS is described in detail in .

The method is implemented in the sgs R package. The package has implementations for Gaussian and Binomial responses, both of which are demonstrated here.

# Gaussian response

### Data

For this example, a $$400\times 500$$ input matrix is used with a simple grouping structure, sampled from a multivariate Gaussian distribution with no correlation.

library(sgs)
groups = c(rep(1:20, each=3),
rep(21:40, each=4),
rep(41:60, each=5),
rep(61:80, each=6),
rep(81:100, each=7))

data = generate_toy_data(p=500, n=400, groups = groups, seed_id=3)

### Fitting an SGS model

We now fit an SGS model to the data using linear regression. The SGS model has many different hyperparameters which can be tuned/selected. Of particular importance is the $$\lambda$$ parameter, which defines the level of sparsity in the model. First, we select this manually and then next use cross-validation to tune it. The other parameters we leave as their default values, although they can easily be changed.

model = fit_sgs(X = data$X, y = data$y, groups = groups, type="linear", lambda = 1, alpha=0.95, vFDR=0.1, gFDR=0.1, standardise = "l2", intercept = TRUE, verbose=FALSE)

Note: we have fit an intercept and applied $$\ell_2$$ standardisation. This is the recommended usage when applying SGS.

### Output of model

The package provides several useful outputs after fitting a model. The vector shows the fitted values (note the intercept). We can also recover the indices of the non-zero variables and groups, which are indexed from the first variable, not the intercept.

model$beta[model$selected_var+1,]
##       v100       v136       v234       v334
##  1.2211478  3.0393493  2.4211141 -0.4108451
model$group.effects[model$selected_group,]
##       G30       G39       G59       G76
## 1.2211478 3.0393493 2.4211141 0.4108451
model$selected_var ## [1] 100 136 234 334 model$selected_group
## [1] 30 39 59 76

Defining a function that lets us calculate various metrics (including the FDR and sensitivity):

fdr_sensitivity = function(fitted_ids, true_ids,num_coef){
# calculates FDR, FPR, and sensitivity
num_true = length(intersect(fitted_ids,true_ids))
num_false = length(fitted_ids) - num_true
num_missed = length(true_ids) - num_true
num_true_negatives = num_coef - length(true_ids)
out=c()
out$fdr = num_false / (num_true + num_false) if (is.nan(out$fdr)){out$fdr = 0} out$sensitivity = num_true / length(true_ids)
if (length(true_ids) == 0){
out$sensitivity = 1 } out$fpr = num_false / num_true_negatives
out$f1 = (2*num_true)/(2*num_true + num_false + num_missed) if (is.nan(out$f1)){out$f1 = 1} return(out) } Calculating relevant metrics give fdr_sensitivity(fitted_ids = model$selected_var, true_ids = data$true_var_id, num_coef = 500) ##$fdr
## [1] 0
##
## $sensitivity ## [1] 0.1428571 ## ##$fpr
## [1] 0
##
## $f1 ## [1] 0.25 fdr_sensitivity(fitted_ids = model$selected_group, true_ids = data$true_grp_id, num_coef = 100) ##$fdr
## [1] 0
##
## $sensitivity ## [1] 0.4 ## ##$fpr
## [1] 0
##
## $f1 ## [1] 0.5714286 The model is currently too sparse, as our choice of $$\lambda$$ is too high. We can instead use cross-validation. ### Cross validation Cross-validation is used to fit SGS models along a $$\lambda$$ path of length $$20$$. The first value, $$\lambda_\text{max}$$, is chosen to give the null model and the path is terminated at $$\lambda_\text{min} = min_frac \dot \lambda_\text{max}$$. The 1se rule (as in the package) is used to choose the optimal model. cv_model = fit_sgs_cv(X = data$X, y = data$y, groups=groups, type = "linear", nlambda = 20, nfolds=10, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise="l2",intercept=TRUE,verbose=TRUE) ## [1] "Lambda 1/20 done. Lambda: 1.8388. Number of non-zero: 0. Error: 9522.47623165794. Avg iter: 7" ## [1] "Lambda 2/20 done. Lambda: 1.5706. Number of non-zero: 1. Error: 9121.87427456379. Avg iter: 26" ## [1] "Lambda 3/20 done. Lambda: 1.3415. Number of non-zero: 2. Error: 8348.10001181521. Avg iter: 31" ## [1] "Lambda 4/20 done. Lambda: 1.1458. Number of non-zero: 3. Error: 7525.33322733818. Avg iter: 35" ## [1] "Lambda 5/20 done. Lambda: 0.9787. Number of non-zero: 8. Error: 6706.42167334536. Avg iter: 35" ## [1] "Lambda 6/20 done. Lambda: 0.8359. Number of non-zero: 14. Error: 5509.29608346792. Avg iter: 44" ## [1] "Lambda 7/20 done. Lambda: 0.714. Number of non-zero: 15. Error: 4256.09986924448. Avg iter: 37" ## [1] "Lambda 8/20 done. Lambda: 0.6098. Number of non-zero: 15. Error: 3281.14251906525. Avg iter: 37" ## [1] "Lambda 9/20 done. Lambda: 0.5209. Number of non-zero: 15. Error: 2559.14833615298. Avg iter: 37" ## [1] "Lambda 10/20 done. Lambda: 0.4449. Number of non-zero: 19. Error: 2011.70593862728. Avg iter: 36" ## [1] "Lambda 11/20 done. Lambda: 0.38. Number of non-zero: 22. Error: 1552.91875758507. Avg iter: 39" ## [1] "Lambda 12/20 done. Lambda: 0.3246. Number of non-zero: 22. Error: 1179.20560203183. Avg iter: 33" ## [1] "Lambda 13/20 done. Lambda: 0.2772. Number of non-zero: 22. Error: 892.913519340801. Avg iter: 32" ## [1] "Lambda 14/20 done. Lambda: 0.2368. Number of non-zero: 22. Error: 682.816068341828. Avg iter: 34" ## [1] "Lambda 15/20 done. Lambda: 0.2022. Number of non-zero: 24. Error: 525.918317926705. Avg iter: 36" ## [1] "Lambda 16/20 done. Lambda: 0.1727. Number of non-zero: 24. Error: 405.623243053464. Avg iter: 37" ## [1] "Lambda 17/20 done. Lambda: 0.1475. Number of non-zero: 26. Error: 314.722794039507. Avg iter: 39" ## [1] "Lambda 18/20 done. Lambda: 0.126. Number of non-zero: 26. Error: 244.495442693249. Avg iter: 39" ## [1] "Lambda 19/20 done. Lambda: 0.1076. Number of non-zero: 27. Error: 192.736115731821. Avg iter: 40" ## [1] "Lambda 20/20 done. Lambda: 0.0919. Number of non-zero: 27. Error: 154.39418403602. Avg iter: 41" The fitting verbose contains useful information, showing the error for each $$\lambda$$ values, as well as the number of non-zero parameters. Aside from the fitting verbose, we can see a more succinct summary by using the function print(cv_model) ## ## regression type: linear ## ## lambda error estimated_non_zero ## [1,] 1.8387781 9522.4762 0 ## [2,] 1.5705583 9121.8743 1 ## [3,] 1.3414633 8348.1000 2 ## [4,] 1.1457860 7525.3332 3 ## [5,] 0.9786519 6706.4217 8 ## [6,] 0.8358975 5509.2961 14 ## [7,] 0.7139663 4256.0999 15 ## [8,] 0.6098211 3281.1425 15 ## [9,] 0.5208674 2559.1483 15 ## [10,] 0.4448893 2011.7059 19 ## [11,] 0.3799940 1552.9188 22 ## [12,] 0.3245648 1179.2056 22 ## [13,] 0.2772210 892.9135 22 ## [14,] 0.2367832 682.8161 22 ## [15,] 0.2022440 525.9183 24 ## [16,] 0.1727430 405.6232 24 ## [17,] 0.1475452 314.7228 26 ## [18,] 0.1260230 244.4954 26 ## [19,] 0.1076402 192.7361 27 ## [20,] 0.0919389 154.3942 27 The best model is found to be the one at the end of the path. Checking the metrics again, we see how CV has generated a model with the correct amount of sparsity that gives FDR levels below the specified values. fdr_sensitivity(fitted_ids = cv_model$fit$selected_var, true_ids = data$true_var_id, num_coef = 500)
## $fdr ## [1] 0.03703704 ## ##$sensitivity
## [1] 0.9285714
##
## $fpr ## [1] 0.002118644 ## ##$f1
## [1] 0.9454545
fdr_sensitivity(fitted_ids = cv_model$fit$selected_group, true_ids = data$true_grp_id, num_coef = 100) ##$fdr
## [1] 0.09090909
##
## $sensitivity ## [1] 1 ## ##$fpr
## [1] 0.01111111
##
## $f1 ## [1] 0.952381 ### Plot We can visualise the solution using the plot function plot(cv_model,how_many = 10) ### Prediction The package has an implemented predict function, to allow for easy prediction predict(model,data$X,type="linear")[1:5]
## [1]  2.79322803  0.03135326  5.00188815  4.00179438 -2.74987113

# Logistic regression

As mentioned, the package can also be used to fit SGS to a Binomial response. First, we generate some Binomial data. We can use the same input matrix, $$X$$, and true $$\beta$$ as before.

sigmoid = function(x) {
1 / (1 + exp(-x))
}
y = ifelse(sigmoid(data$X %*% data$true_beta + rnorm(400))>0.5,1,0)
train_y = y[1:350]
test_y = y[351:400]
train_X = data$X[1:350,] test_X = data$X[351:400,]

### Fitting and prediction

We can again apply CV.

cv_model = fit_sgs_cv(X = train_X, y = train_y, groups=groups, type = "logistic", nlambda = 20, nfolds=10, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise="l2",intercept=FALSE,verbose=TRUE)
## [1] "Lambda 1/20 done. Lambda: 0.0486. Number of non-zero: 0. Misclassification error: 0.437142857142857. Avg iter: 2"
## [1] "Lambda 2/20 done. Lambda: 0.0415. Number of non-zero: 1. Misclassification error: 0.351428571428571. Avg iter: 4"
## [1] "Lambda 3/20 done. Lambda: 0.0355. Number of non-zero: 2. Misclassification error: 0.331428571428571. Avg iter: 5"
## [1] "Lambda 4/20 done. Lambda: 0.0303. Number of non-zero: 2. Misclassification error: 0.314285714285714. Avg iter: 7"
## [1] "Lambda 5/20 done. Lambda: 0.0259. Number of non-zero: 9. Misclassification error: 0.28. Avg iter: 11"
## [1] "Lambda 6/20 done. Lambda: 0.0221. Number of non-zero: 11. Misclassification error: 0.222857142857143. Avg iter: 9"
## [1] "Lambda 7/20 done. Lambda: 0.0189. Number of non-zero: 13. Misclassification error: 0.208571428571429. Avg iter: 11"
## [1] "Lambda 8/20 done. Lambda: 0.0161. Number of non-zero: 23. Misclassification error: 0.177142857142857. Avg iter: 15"
## [1] "Lambda 9/20 done. Lambda: 0.0138. Number of non-zero: 34. Misclassification error: 0.14. Avg iter: 19"
## [1] "Lambda 10/20 done. Lambda: 0.0118. Number of non-zero: 43. Misclassification error: 0.134285714285714. Avg iter: 20"
## [1] "Lambda 11/20 done. Lambda: 0.0101. Number of non-zero: 57. Misclassification error: 0.131428571428571. Avg iter: 21"
## [1] "Lambda 12/20 done. Lambda: 0.0086. Number of non-zero: 80. Misclassification error: 0.142857142857143. Avg iter: 21"
## [1] "Lambda 13/20 done. Lambda: 0.0073. Number of non-zero: 102. Misclassification error: 0.14. Avg iter: 22"
## [1] "Lambda 14/20 done. Lambda: 0.0063. Number of non-zero: 113. Misclassification error: 0.148571428571429. Avg iter: 26"
## [1] "Lambda 15/20 done. Lambda: 0.0054. Number of non-zero: 120. Misclassification error: 0.145714285714286. Avg iter: 30"
## [1] "Lambda 16/20 done. Lambda: 0.0046. Number of non-zero: 124. Misclassification error: 0.145714285714286. Avg iter: 36"
## [1] "Lambda 17/20 done. Lambda: 0.0039. Number of non-zero: 132. Misclassification error: 0.145714285714286. Avg iter: 45"
## [1] "Lambda 18/20 done. Lambda: 0.0033. Number of non-zero: 136. Misclassification error: 0.137142857142857. Avg iter: 55"
## [1] "Lambda 19/20 done. Lambda: 0.0028. Number of non-zero: 154. Misclassification error: 0.148571428571429. Avg iter: 67"
## [1] "Lambda 20/20 done. Lambda: 0.0024. Number of non-zero: 164. Misclassification error: 0.148571428571429. Avg iter: 81"

and again, use the predict function

predictions = predict(cv_model$fit,test_X,type="logistic") In the Binomial case, the function returns both the predicted class probabilities (response) and the predicted class (class). We can use this to check the prediction accuracy, given as $$84\%$$. predictions$response[1:5]
## [1] 0.4085523 0.6480921 0.1706632 0.1489437 0.3894489
predictions$class[1:5] ## [1] 0 1 0 0 0 sum(predictions$class == test_y)/length(test_y)
## [1] 0.82