# Forward Stepwise Search

### Background

A common problem in building statistical models is determining which features to include in a model. Mathematical publications provide some suggestions, but there is no consensus. Some examples are the lasso or simply trying all possible combinations of predictors. For large data, both of these require extensive computation time. For the lasso, a good value of lambda must be cross validated and trying all possible combinations of predictors requires building many models.

With a multithreaded BLAS, forward stepwise search provides a computationally light weight feature selection method. No resampling is needed, and the feature space is searched in an efficient way. In this vignette, this method will be tested in a variety of situations.

### Mathematical Background

The more parameters a model has, the better it will fit the data. However, if the model is too complex, the worse it will perform on unseen data. AIC strikes a balance between fitting the training data well and keeping the model simple.

Using AIC, a forward search starts with no features. Then each feature is considered. If there are 10 features, there are 10 models under consideration. For each model, AIC is calculated and the model with the lowest AIC is selected. After the first feature is selected, all remaining 9 features are considered. Of the 9 features, the one with the lowest AIC is selected, creating a 2 feature model. When adding no more features improve AIC, the procedure stops.

### Easy Problem: Large N And Only A Few Unrelated Variables

library(GlmSimulatoR)
library(ggplot2)
library(MASS)
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#>     select

set.seed(1)
simdata <- simulate_inverse_gaussion(N = 100000, link = "1/mu^2",
weights = c(1, 2, 3), unrelated = 3)

#Y looks like an inverse gaussion distribution.
ggplot(simdata, aes(x=Y)) +
geom_histogram(bins = 30)


scopeArg <- list(
lower = Y ~ 1,
upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3
)

startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg)
#> Start:  AIC=-209832
#> Y ~ 1
#>
#>              Df Deviance     AIC
#> + X3          1    33541 -211190
#> + X2          1    33792 -210458
#> + X1          1    33956 -209982
#> <none>             34008 -209832
#> + Unrelated3  1    34008 -209830
#> + Unrelated1  1    34008 -209830
#> + Unrelated2  1    34008 -209830
#>
#> Step:  AIC=-211211.7
#> Y ~ X3
#>
#>              Df Deviance     AIC
#> + X2          1    33327 -211844
#> + X1          1    33489 -211366
#> <none>             33541 -211212
#> + Unrelated3  1    33541 -211210
#> + Unrelated1  1    33541 -211210
#> + Unrelated2  1    33541 -211210
#> - X3          1    34008 -209830
#>
#> Step:  AIC=-211849.4
#> Y ~ X3 + X2
#>
#>              Df Deviance     AIC
#> + X1          1    33273 -212009
#> <none>             33327 -211849
#> + Unrelated3  1    33327 -211848
#> + Unrelated1  1    33327 -211847
#> + Unrelated2  1    33327 -211847
#> - X2          1    33541 -211212
#> - X3          1    33792 -210460
#>
#> Step:  AIC=-212009.4
#> Y ~ X3 + X2 + X1
#>
#>              Df Deviance     AIC
#> <none>             33273 -212009
#> + Unrelated3  1    33273 -212008
#> + Unrelated1  1    33273 -212008
#> + Unrelated2  1    33273 -212007
#> - X1          1    33327 -211850
#> - X2          1    33489 -211367
#> - X3          1    33739 -210616
summary(glmSearch)
#>
#> Call:
#> glm(formula = Y ~ X3 + X2 + X1, family = inverse.gaussian(link = "1/mu^2"),
#>     data = simdata)
#>
#> Deviance Residuals:
#>     Min       1Q   Median       3Q      Max
#> -2.6856  -0.4742  -0.0887   0.2979   2.3616
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  2.81192    0.20843   13.49   <2e-16 ***
#> X3           3.03191    0.08116   37.36   <2e-16 ***
#> X2           2.05731    0.08101   25.40   <2e-16 ***
#> X1           1.02594    0.08067   12.72   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3335926)
#>
#>     Null deviance: 34008  on 99999  degrees of freedom
#> Residual deviance: 33273  on 99996  degrees of freedom
#> AIC: -212009
#>
#> Number of Fisher Scoring iterations: 5

rm(simdata, scopeArg, glmSearch, startingModel)

Looking at the summary, the correct model was found. Forward stepwise search worked perfectly!

### Medium Problem: Large N And A Lot Of Unrelated Variables

set.seed(2)
simdata <- simulate_inverse_gaussion(N = 100000, link = "1/mu^2",
weights = c(1, 2, 3), unrelated = 20)

#Y looks like an inverse gaussion distribution.
ggplot(simdata, aes(x=Y)) +
geom_histogram(bins = 30)


scopeArg <- list(
lower = Y ~ 1,
upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3 + Unrelated3 +
Unrelated4 + Unrelated5 + Unrelated6 + Unrelated7 + Unrelated8 + Unrelated9 +
Unrelated10 + Unrelated11 + Unrelated12 + Unrelated13 + Unrelated14 + Unrelated15 +
Unrelated16 + Unrelated17 + Unrelated18 + Unrelated19 + Unrelated20
)

startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg)
#> Start:  AIC=-210348.5
#> Y ~ 1
#>
#>               Df Deviance     AIC
#> + X3           1    33551 -211691
#> + X2           1    33817 -210909
#> + X1           1    33955 -210505
#> + Unrelated4   1    34007 -210353
#> + Unrelated9   1    34008 -210350
#> + Unrelated14  1    34008 -210349
#> + Unrelated19  1    34009 -210349
#> <none>              34009 -210349
#> + Unrelated18  1    34009 -210348
#> + Unrelated5   1    34009 -210348
#> + Unrelated20  1    34009 -210348
#> + Unrelated17  1    34009 -210347
#> + Unrelated1   1    34009 -210347
#> + Unrelated3   1    34009 -210347
#> + Unrelated6   1    34009 -210347
#> + Unrelated2   1    34009 -210347
#> + Unrelated11  1    34009 -210347
#> + Unrelated13  1    34009 -210347
#> + Unrelated16  1    34009 -210347
#> + Unrelated7   1    34009 -210347
#> + Unrelated15  1    34009 -210347
#> + Unrelated10  1    34009 -210347
#> + Unrelated8   1    34009 -210347
#> + Unrelated12  1    34009 -210347
#>
#> Step:  AIC=-211704.9
#> Y ~ X3
#>
#>               Df Deviance     AIC
#> + X2           1    33357 -212281
#> + X1           1    33496 -211865
#> + Unrelated4   1    33548 -211710
#> + Unrelated9   1    33550 -211706
#> + Unrelated14  1    33550 -211706
#> + Unrelated19  1    33550 -211706
#> + Unrelated18  1    33550 -211705
#> <none>              33551 -211705
#> + Unrelated17  1    33550 -211704
#> + Unrelated5   1    33550 -211704
#> + Unrelated20  1    33550 -211704
#> + Unrelated1   1    33550 -211704
#> + Unrelated3   1    33550 -211703
#> + Unrelated6   1    33550 -211703
#> + Unrelated2   1    33551 -211703
#> + Unrelated13  1    33551 -211703
#> + Unrelated8   1    33551 -211703
#> + Unrelated15  1    33551 -211703
#> + Unrelated11  1    33551 -211703
#> + Unrelated7   1    33551 -211703
#> + Unrelated12  1    33551 -211703
#> + Unrelated10  1    33551 -211703
#> + Unrelated16  1    33551 -211703
#> - X3           1    34009 -210338
#>
#> Step:  AIC=-212282
#> Y ~ X3 + X2
#>
#>               Df Deviance     AIC
#> + X1           1    33303 -212442
#> + Unrelated4   1    33355 -212287
#> + Unrelated14  1    33356 -212283
#> + Unrelated9   1    33356 -212283
#> + Unrelated18  1    33356 -212282
#> + Unrelated19  1    33356 -212282
#> <none>              33357 -212282
#> + Unrelated17  1    33356 -212281
#> + Unrelated20  1    33356 -212281
#> + Unrelated5   1    33357 -212281
#> + Unrelated1   1    33357 -212281
#> + Unrelated6   1    33357 -212281
#> + Unrelated3   1    33357 -212280
#> + Unrelated13  1    33357 -212280
#> + Unrelated2   1    33357 -212280
#> + Unrelated8   1    33357 -212280
#> + Unrelated15  1    33357 -212280
#> + Unrelated16  1    33357 -212280
#> + Unrelated10  1    33357 -212280
#> + Unrelated12  1    33357 -212280
#> + Unrelated11  1    33357 -212280
#> + Unrelated7   1    33357 -212280
#> - X2           1    33551 -211702
#> - X3           1    33817 -210899
#>
#> Step:  AIC=-212441.3
#> Y ~ X3 + X2 + X1
#>
#>               Df Deviance     AIC
#> + Unrelated4   1    33301 -212446
#> + Unrelated14  1    33302 -212442
#> + Unrelated18  1    33302 -212442
#> + Unrelated9   1    33302 -212442
#> + Unrelated19  1    33302 -212442
#> <none>              33303 -212441
#> + Unrelated20  1    33303 -212440
#> + Unrelated17  1    33303 -212440
#> + Unrelated5   1    33303 -212440
#> + Unrelated1   1    33303 -212440
#> + Unrelated6   1    33303 -212440
#> + Unrelated3   1    33303 -212440
#> + Unrelated13  1    33303 -212439
#> + Unrelated2   1    33303 -212439
#> + Unrelated15  1    33303 -212439
#> + Unrelated8   1    33303 -212439
#> + Unrelated16  1    33303 -212439
#> + Unrelated10  1    33303 -212439
#> + Unrelated12  1    33303 -212439
#> + Unrelated7   1    33303 -212439
#> + Unrelated11  1    33303 -212439
#> - X1           1    33357 -212281
#> - X2           1    33496 -211861
#> - X3           1    33764 -211055
#>
#> Step:  AIC=-212446.4
#> Y ~ X3 + X2 + X1 + Unrelated4
#>
#>               Df Deviance     AIC
#> + Unrelated18  1    33300 -212447
#> + Unrelated14  1    33300 -212447
#> + Unrelated9   1    33300 -212447
#> + Unrelated19  1    33300 -212447
#> <none>              33301 -212446
#> + Unrelated20  1    33300 -212446
#> + Unrelated17  1    33300 -212445
#> + Unrelated5   1    33300 -212445
#> + Unrelated1   1    33300 -212445
#> + Unrelated6   1    33301 -212445
#> + Unrelated3   1    33301 -212445
#> + Unrelated13  1    33301 -212445
#> + Unrelated2   1    33301 -212445
#> + Unrelated15  1    33301 -212444
#> + Unrelated8   1    33301 -212444
#> + Unrelated16  1    33301 -212444
#> + Unrelated10  1    33301 -212444
#> + Unrelated12  1    33301 -212444
#> + Unrelated7   1    33301 -212444
#> + Unrelated11  1    33301 -212444
#> - Unrelated4   1    33303 -212441
#> - X1           1    33355 -212286
#> - X2           1    33494 -211867
#> - X3           1    33761 -211060
#>
#> Step:  AIC=-212447.3
#> Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18
#>
#>               Df Deviance     AIC
#> + Unrelated14  1    33299 -212448
#> + Unrelated9   1    33299 -212448
#> + Unrelated19  1    33299 -212448
#> <none>              33300 -212447
#> + Unrelated20  1    33299 -212446
#> - Unrelated18  1    33301 -212446
#> + Unrelated17  1    33299 -212446
#> + Unrelated5   1    33299 -212446
#> + Unrelated1   1    33299 -212446
#> + Unrelated6   1    33300 -212446
#> + Unrelated3   1    33300 -212446
#> + Unrelated13  1    33300 -212445
#> + Unrelated2   1    33300 -212445
#> + Unrelated15  1    33300 -212445
#> + Unrelated8   1    33300 -212445
#> + Unrelated16  1    33300 -212445
#> + Unrelated10  1    33300 -212445
#> + Unrelated12  1    33300 -212445
#> + Unrelated7   1    33300 -212445
#> + Unrelated11  1    33300 -212445
#> - Unrelated4   1    33302 -212442
#> - X1           1    33354 -212286
#> - X2           1    33493 -211867
#> - X3           1    33761 -211060
#>
#> Step:  AIC=-212448.1
#> Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18 + Unrelated14
#>
#>               Df Deviance     AIC
#> + Unrelated9   1    33298 -212449
#> + Unrelated19  1    33298 -212449
#> <none>              33299 -212448
#> - Unrelated14  1    33300 -212447
#> - Unrelated18  1    33300 -212447
#> + Unrelated20  1    33298 -212447
#> + Unrelated17  1    33298 -212447
#> + Unrelated5   1    33298 -212447
#> + Unrelated1   1    33299 -212447
#> + Unrelated6   1    33299 -212447
#> + Unrelated3   1    33299 -212447
#> + Unrelated13  1    33299 -212446
#> + Unrelated2   1    33299 -212446
#> + Unrelated15  1    33299 -212446
#> + Unrelated8   1    33299 -212446
#> + Unrelated16  1    33299 -212446
#> + Unrelated10  1    33299 -212446
#> + Unrelated12  1    33299 -212446
#> + Unrelated7   1    33299 -212446
#> + Unrelated11  1    33299 -212446
#> - Unrelated4   1    33301 -212443
#> - X1           1    33353 -212287
#> - X2           1    33492 -211868
#> - X3           1    33760 -211061
#>
#> Step:  AIC=-212448.9
#> Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18 + Unrelated14 + Unrelated9
#>
#>               Df Deviance     AIC
#> + Unrelated19  1    33297 -212450
#> <none>              33298 -212449
#> - Unrelated9   1    33299 -212448
#> - Unrelated18  1    33299 -212448
#> - Unrelated14  1    33299 -212448
#> + Unrelated20  1    33298 -212448
#> + Unrelated17  1    33298 -212448
#> + Unrelated5   1    33298 -212448
#> + Unrelated1   1    33298 -212448
#> + Unrelated6   1    33298 -212447
#> + Unrelated3   1    33298 -212447
#> + Unrelated13  1    33298 -212447
#> + Unrelated2   1    33298 -212447
#> + Unrelated15  1    33298 -212447
#> + Unrelated8   1    33298 -212447
#> + Unrelated16  1    33298 -212447
#> + Unrelated10  1    33298 -212447
#> + Unrelated12  1    33298 -212447
#> + Unrelated7   1    33298 -212447
#> + Unrelated11  1    33298 -212447
#> - Unrelated4   1    33300 -212444
#> - X1           1    33352 -212288
#> - X2           1    33491 -211869
#> - X3           1    33759 -211062
#>
#> Step:  AIC=-212449.5
#> Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18 + Unrelated14 + Unrelated9 +
#>     Unrelated19
#>
#>               Df Deviance     AIC
#> <none>              33297 -212450
#> - Unrelated19  1    33298 -212449
#> - Unrelated9   1    33298 -212449
#> - Unrelated18  1    33298 -212449
#> - Unrelated14  1    33298 -212449
#> + Unrelated20  1    33297 -212449
#> + Unrelated17  1    33297 -212449
#> + Unrelated5   1    33297 -212449
#> + Unrelated1   1    33297 -212448
#> + Unrelated6   1    33297 -212448
#> + Unrelated3   1    33297 -212448
#> + Unrelated13  1    33297 -212448
#> + Unrelated2   1    33297 -212448
#> + Unrelated15  1    33297 -212448
#> + Unrelated8   1    33297 -212448
#> + Unrelated16  1    33297 -212448
#> + Unrelated10  1    33297 -212448
#> + Unrelated12  1    33297 -212448
#> + Unrelated7   1    33297 -212448
#> + Unrelated11  1    33297 -212448
#> - Unrelated4   1    33299 -212444
#> - X1           1    33351 -212288
#> - X2           1    33490 -211870
#> - X3           1    33758 -211062
summary(glmSearch)
#>
#> Call:
#> glm(formula = Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18 + Unrelated14 +
#>     Unrelated9 + Unrelated19, family = inverse.gaussian(link = "1/mu^2"),
#>     data = simdata)
#>
#> Deviance Residuals:
#>      Min        1Q    Median        3Q       Max
#> -2.58085  -0.46828  -0.08548   0.29831   2.38619
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  3.09437    0.34081   9.080  < 2e-16 ***
#> X3           3.02170    0.08107  37.274  < 2e-16 ***
#> X2           1.95173    0.08092  24.120  < 2e-16 ***
#> X1           1.03102    0.08070  12.776  < 2e-16 ***
#> Unrelated4   0.21603    0.08082   2.673  0.00752 **
#> Unrelated18 -0.13565    0.08087  -1.677  0.09347 .
#> Unrelated14  0.13661    0.08081   1.691  0.09092 .
#> Unrelated9  -0.13463    0.08080  -1.666  0.09567 .
#> Unrelated19 -0.13307    0.08086  -1.646  0.09982 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3318398)
#>
#>     Null deviance: 34009  on 99999  degrees of freedom
#> Residual deviance: 33297  on 99991  degrees of freedom
#> AIC: -212450
#>
#> Number of Fisher Scoring iterations: 5

rm(simdata, scopeArg, glmSearch, startingModel)

Some unrelated variables made it into the final model. At least all related features are in the model.

### Hard Problem: Small N And Only A Few Unrelated Variables

set.seed(3)
simdata <- simulate_inverse_gaussion(N = 1000, link = "1/mu^2",
weights = c(1, 2, 3), unrelated = 3)

#Y looks like an inverse gaussion distribution.
ggplot(simdata, aes(x=Y)) +
geom_histogram(bins = 30)


scopeArg <- list(
lower = Y ~ 1,
upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3
)

startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg)
#> Start:  AIC=-2091.87
#> Y ~ 1
#>
#>              Df Deviance     AIC
#> + X3          1   344.37 -2100.2
#> + X2          1   346.42 -2094.4
#> + X1          1   347.08 -2092.6
#> <none>            348.05 -2091.9
#> + Unrelated1  1   347.48 -2091.5
#> + Unrelated3  1   347.86 -2090.4
#> + Unrelated2  1   348.05 -2089.9
#>
#> Step:  AIC=-2100.52
#> Y ~ X3
#>
#>              Df Deviance     AIC
#> + X2          1   342.77 -2103.1
#> + X1          1   343.29 -2101.6
#> <none>            344.37 -2100.5
#> + Unrelated1  1   343.80 -2100.2
#> + Unrelated3  1   344.24 -2098.9
#> + Unrelated2  1   344.35 -2098.6
#> - X3          1   348.05 -2092.0
#>
#> Step:  AIC=-2103.17
#> Y ~ X3 + X2
#>
#>              Df Deviance     AIC
#> + X1          1   341.61 -2104.5
#> <none>            342.77 -2103.2
#> + Unrelated1  1   342.23 -2102.7
#> + Unrelated3  1   342.68 -2101.4
#> + Unrelated2  1   342.74 -2101.2
#> - X2          1   344.37 -2100.6
#> - X3          1   346.42 -2094.7
#>
#> Step:  AIC=-2104.55
#> Y ~ X3 + X2 + X1
#>
#>              Df Deviance     AIC
#> <none>            341.61 -2104.6
#> + Unrelated1  1   341.07 -2104.1
#> - X1          1   342.77 -2103.2
#> + Unrelated3  1   341.48 -2102.9
#> + Unrelated2  1   341.58 -2102.6
#> - X2          1   343.29 -2101.7
#> - X3          1   345.36 -2095.8
summary(glmSearch)
#>
#> Call:
#> glm(formula = Y ~ X3 + X2 + X1, family = inverse.gaussian(link = "1/mu^2"),
#>     data = simdata)
#>
#> Deviance Residuals:
#>      Min        1Q    Median        3Q       Max
#> -1.75911  -0.49424  -0.08638   0.30464   1.85673
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   2.8866     2.1941   1.316  0.18860
#> X3            2.7355     0.8310   3.292  0.00103 **
#> X2            1.8694     0.8491   2.202  0.02792 *
#> X1            1.5190     0.8304   1.829  0.06767 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3466483)
#>
#>     Null deviance: 348.05  on 999  degrees of freedom
#> Residual deviance: 341.61  on 996  degrees of freedom
#> AIC: -2104.6
#>
#> Number of Fisher Scoring iterations: 5

rm(simdata, scopeArg, glmSearch, startingModel)

The correct model was found. Again, Forward stepwise search worked perfectly!

### Very Hard Problem: Small N And A Lot Of Unrelated Variables

set.seed(4)
simdata <- simulate_inverse_gaussion(N = 1000, link = "1/mu^2",
weights = c(1, 2, 3), unrelated = 20)

#Y looks like an inverse gaussion distribution.
ggplot(simdata, aes(x=Y)) +
geom_histogram(bins = 30)


scopeArg <- list(
lower = Y ~ 1,
upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3 + Unrelated3 +
Unrelated4 + Unrelated5 + Unrelated6 + Unrelated7 + Unrelated8 + Unrelated9 +
Unrelated10 + Unrelated11 + Unrelated12 + Unrelated13 + Unrelated14 + Unrelated15 +
Unrelated16 + Unrelated17 + Unrelated18 + Unrelated19 + Unrelated20
)

startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg)
#> Start:  AIC=-2125.88
#> Y ~ 1
#>
#>               Df Deviance     AIC
#> + X3           1   340.71 -2136.2
#> + X2           1   343.81 -2127.4
#> + X1           1   343.86 -2127.3
#> + Unrelated8   1   343.88 -2127.2
#> + Unrelated20  1   344.22 -2126.3
#> + Unrelated15  1   344.32 -2126.0
#> + Unrelated19  1   344.32 -2126.0
#> <none>             345.07 -2125.9
#> + Unrelated7   1   344.59 -2125.2
#> + Unrelated11  1   344.74 -2124.8
#> + Unrelated6   1   344.81 -2124.6
#> + Unrelated4   1   344.82 -2124.6
#> + Unrelated1   1   344.84 -2124.5
#> + Unrelated2   1   344.89 -2124.4
#> + Unrelated10  1   344.95 -2124.2
#> + Unrelated12  1   344.97 -2124.2
#> + Unrelated14  1   345.01 -2124.0
#> + Unrelated13  1   345.02 -2124.0
#> + Unrelated3   1   345.06 -2123.9
#> + Unrelated18  1   345.06 -2123.9
#> + Unrelated5   1   345.06 -2123.9
#> + Unrelated9   1   345.06 -2123.9
#> + Unrelated17  1   345.06 -2123.9
#> + Unrelated16  1   345.07 -2123.9
#>
#> Step:  AIC=-2136.6
#> Y ~ X3
#>
#>               Df Deviance     AIC
#> + X2           1   339.40 -2138.4
#> + Unrelated8   1   339.50 -2138.1
#> + X1           1   339.54 -2138.0
#> + Unrelated15  1   339.88 -2137.0
#> + Unrelated20  1   339.91 -2136.9
#> + Unrelated19  1   339.94 -2136.8
#> <none>             340.71 -2136.6
#> + Unrelated7   1   340.18 -2136.1
#> + Unrelated11  1   340.30 -2135.8
#> + Unrelated2   1   340.48 -2135.2
#> + Unrelated1   1   340.53 -2135.1
#> + Unrelated4   1   340.54 -2135.1
#> + Unrelated6   1   340.56 -2135.0
#> + Unrelated12  1   340.58 -2135.0
#> + Unrelated10  1   340.62 -2134.8
#> + Unrelated16  1   340.66 -2134.7
#> + Unrelated13  1   340.66 -2134.7
#> + Unrelated17  1   340.67 -2134.7
#> + Unrelated5   1   340.68 -2134.7
#> + Unrelated9   1   340.69 -2134.6
#> + Unrelated18  1   340.70 -2134.6
#> + Unrelated14  1   340.70 -2134.6
#> + Unrelated3   1   340.70 -2134.6
#> - X3           1   345.07 -2126.0
#>
#> Step:  AIC=-2138.46
#> Y ~ X3 + X2
#>
#>               Df Deviance     AIC
#> + Unrelated8   1   338.21 -2139.9
#> + X1           1   338.28 -2139.7
#> + Unrelated19  1   338.43 -2139.2
#> + Unrelated15  1   338.58 -2138.8
#> + Unrelated20  1   338.58 -2138.8
#> <none>             339.40 -2138.5
#> + Unrelated7   1   338.90 -2137.9
#> + Unrelated11  1   339.01 -2137.6
#> + Unrelated2   1   339.16 -2137.1
#> + Unrelated4   1   339.20 -2137.0
#> + Unrelated1   1   339.22 -2136.9
#> + Unrelated12  1   339.25 -2136.9
#> + Unrelated6   1   339.25 -2136.9
#> + Unrelated10  1   339.29 -2136.8
#> - X2           1   340.71 -2136.7
#> + Unrelated16  1   339.34 -2136.6
#> + Unrelated17  1   339.34 -2136.6
#> + Unrelated13  1   339.35 -2136.6
#> + Unrelated9   1   339.38 -2136.5
#> + Unrelated5   1   339.38 -2136.5
#> + Unrelated18  1   339.39 -2136.5
#> + Unrelated14  1   339.39 -2136.5
#> + Unrelated3   1   339.39 -2136.5
#> - X3           1   343.81 -2127.7
#>
#> Step:  AIC=-2139.96
#> Y ~ X3 + X2 + Unrelated8
#>
#>               Df Deviance     AIC
#> + X1           1   337.03 -2141.4
#> + Unrelated19  1   337.29 -2140.6
#> + Unrelated20  1   337.45 -2140.2
#> + Unrelated15  1   337.46 -2140.1
#> <none>             338.21 -2140.0
#> + Unrelated7   1   337.73 -2139.3
#> + Unrelated11  1   337.78 -2139.2
#> + Unrelated4   1   337.96 -2138.7
#> + Unrelated2   1   338.00 -2138.6
#> - Unrelated8   1   339.40 -2138.5
#> + Unrelated1   1   338.03 -2138.5
#> + Unrelated12  1   338.08 -2138.3
#> + Unrelated6   1   338.10 -2138.3
#> + Unrelated10  1   338.10 -2138.3
#> - X2           1   339.50 -2138.2
#> + Unrelated17  1   338.15 -2138.1
#> + Unrelated13  1   338.15 -2138.1
#> + Unrelated16  1   338.16 -2138.1
#> + Unrelated5   1   338.18 -2138.0
#> + Unrelated14  1   338.19 -2138.0
#> + Unrelated9   1   338.20 -2138.0
#> + Unrelated3   1   338.20 -2138.0
#> + Unrelated18  1   338.20 -2138.0
#> - X3           1   342.65 -2129.1
#>
#> Step:  AIC=-2141.47
#> Y ~ X3 + X2 + Unrelated8 + X1
#>
#>               Df Deviance     AIC
#> + Unrelated19  1   336.14 -2142.1
#> + Unrelated20  1   336.23 -2141.8
#> + Unrelated15  1   336.29 -2141.6
#> <none>             337.03 -2141.5
#> + Unrelated11  1   336.53 -2140.9
#> + Unrelated7   1   336.57 -2140.8
#> + Unrelated2   1   336.78 -2140.2
#> + Unrelated4   1   336.82 -2140.1
#> - X1           1   338.21 -2140.0
#> + Unrelated1   1   336.85 -2140.0
#> + Unrelated12  1   336.87 -2139.9
#> - X2           1   338.26 -2139.9
#> - Unrelated8   1   338.28 -2139.8
#> + Unrelated6   1   336.92 -2139.8
#> + Unrelated10  1   336.94 -2139.7
#> + Unrelated16  1   336.95 -2139.7
#> + Unrelated13  1   336.96 -2139.7
#> + Unrelated17  1   336.96 -2139.7
#> + Unrelated9   1   337.00 -2139.6
#> + Unrelated5   1   337.01 -2139.5
#> + Unrelated14  1   337.01 -2139.5
#> + Unrelated3   1   337.01 -2139.5
#> + Unrelated18  1   337.02 -2139.5
#> - X3           1   341.43 -2130.6
#>
#> Step:  AIC=-2142.1
#> Y ~ X3 + X2 + Unrelated8 + X1 + Unrelated19
#>
#>               Df Deviance     AIC
#> + Unrelated15  1   335.37 -2142.3
#> + Unrelated20  1   335.38 -2142.3
#> <none>             336.14 -2142.1
#> + Unrelated7   1   335.62 -2141.6
#> + Unrelated11  1   335.63 -2141.6
#> - Unrelated19  1   337.03 -2141.5
#> + Unrelated2   1   335.87 -2140.9
#> + Unrelated4   1   335.91 -2140.8
#> - X1           1   337.29 -2140.7
#> + Unrelated1   1   335.96 -2140.6
#> - Unrelated8   1   337.35 -2140.6
#> + Unrelated12  1   336.00 -2140.5
#> + Unrelated6   1   336.03 -2140.4
#> + Unrelated17  1   336.06 -2140.3
#> + Unrelated16  1   336.07 -2140.3
#> + Unrelated13  1   336.07 -2140.3
#> + Unrelated10  1   336.09 -2140.2
#> + Unrelated5   1   336.11 -2140.2
#> + Unrelated9   1   336.11 -2140.2
#> + Unrelated14  1   336.13 -2140.1
#> + Unrelated3   1   336.13 -2140.1
#> + Unrelated18  1   336.14 -2140.1
#> - X2           1   337.55 -2140.0
#> - X3           1   340.60 -2131.1
#>
#> Step:  AIC=-2142.38
#> Y ~ X3 + X2 + Unrelated8 + X1 + Unrelated19 + Unrelated15
#>
#>               Df Deviance     AIC
#> + Unrelated20  1   334.59 -2142.7
#> <none>             335.37 -2142.4
#> - Unrelated15  1   336.14 -2142.1
#> + Unrelated7   1   334.78 -2142.1
#> + Unrelated11  1   334.89 -2141.8
#> - Unrelated19  1   336.29 -2141.7
#> + Unrelated2   1   335.12 -2141.1
#> - X1           1   336.50 -2141.1
#> - Unrelated8   1   336.51 -2141.0
#> + Unrelated1   1   335.18 -2140.9
#> + Unrelated4   1   335.19 -2140.9
#> + Unrelated12  1   335.21 -2140.9
#> + Unrelated6   1   335.27 -2140.7
#> + Unrelated17  1   335.30 -2140.6
#> + Unrelated16  1   335.30 -2140.6
#> + Unrelated13  1   335.31 -2140.6
#> + Unrelated10  1   335.33 -2140.5
#> + Unrelated9   1   335.35 -2140.4
#> + Unrelated5   1   335.35 -2140.4
#> + Unrelated14  1   335.36 -2140.4
#> + Unrelated3   1   335.36 -2140.4
#> + Unrelated18  1   335.37 -2140.4
#> - X2           1   336.77 -2140.2
#> - X3           1   339.91 -2131.0
#>
#> Step:  AIC=-2142.72
#> Y ~ X3 + X2 + Unrelated8 + X1 + Unrelated19 + Unrelated15 + Unrelated20
#>
#>               Df Deviance     AIC
#> <none>             334.59 -2142.7
#> + Unrelated7   1   333.95 -2142.6
#> - Unrelated20  1   335.37 -2142.4
#> - Unrelated15  1   335.38 -2142.4
#> + Unrelated11  1   334.09 -2142.2
#> - Unrelated19  1   335.48 -2142.1
#> - Unrelated8   1   335.66 -2141.6
#> + Unrelated2   1   334.31 -2141.5
#> + Unrelated4   1   334.38 -2141.3
#> - X1           1   335.75 -2141.3
#> + Unrelated1   1   334.40 -2141.3
#> + Unrelated6   1   334.44 -2141.2
#> + Unrelated12  1   334.44 -2141.2
#> + Unrelated17  1   334.49 -2141.0
#> + Unrelated13  1   334.53 -2140.9
#> + Unrelated16  1   334.53 -2140.9
#> + Unrelated10  1   334.54 -2140.9
#> + Unrelated14  1   334.56 -2140.8
#> + Unrelated9   1   334.56 -2140.8
#> + Unrelated5   1   334.58 -2140.8
#> + Unrelated3   1   334.58 -2140.7
#> + Unrelated18  1   334.58 -2140.7
#> - X2           1   335.99 -2140.6
#> - X3           1   339.08 -2131.5
summary(glmSearch)
#>
#> Call:
#> glm(formula = Y ~ X3 + X2 + Unrelated8 + X1 + Unrelated19 + Unrelated15 +
#>     Unrelated20, family = inverse.gaussian(link = "1/mu^2"),
#>     data = simdata)
#>
#> Deviance Residuals:
#>      Min        1Q    Median        3Q       Max
#> -2.36161  -0.48520  -0.09361   0.29986   1.65164
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   3.2105     3.3207   0.967 0.333882
#> X3            3.0476     0.8385   3.635 0.000293 ***
#> X2            1.7017     0.8362   2.035 0.042112 *
#> Unrelated8   -1.4773     0.8320  -1.776 0.076118 .
#> X1            1.5461     0.8362   1.849 0.064758 .
#> Unrelated19   1.3528     0.8358   1.618 0.105880
#> Unrelated15  -1.3128     0.8585  -1.529 0.126531
#> Unrelated20   1.2960     0.8523   1.520 0.128707
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3392529)
#>
#>     Null deviance: 345.07  on 999  degrees of freedom
#> Residual deviance: 334.59  on 992  degrees of freedom
#> AIC: -2142.7
#>
#> Number of Fisher Scoring iterations: 5

rm(simdata, scopeArg, glmSearch, startingModel)

A few unrelated features made it into the model, but at least all true predictors are in the model.

### Summary

Forward stepwise search provides a computationally fast way to select features. When the related features make up half the total possible features, stepwise search performed perfectly for both small and large n. When there are a lot of unrelated features, forward stepwise finds all related features and erroneously selects a few unrelated variables.