Variable selection using automatic methods

May 22nd, 2010

When we have a set of data with a small number of variables we can easily use a manual approach to identifying a good set of variables and the form they take in our statistical model. In other situations we may have a large number of potentially important variables and it soon becomes a time consuming effort to follow a manual variable selection process. In this case we may consider using automatic subset selection tools to remove some of the burden of the task.

It should be noted that there is some disagreement about whether it is desirable to use an automated method but this post will focus on the mechanics of doing it rather than the debate about whether to be doing it at all.

The R package leaps has a function regsubsets that can be used for best subsets, forward selection and backwards elimination depending on which approach is considered most appropriate for the application under consideration.

In previous post we considered using data on CPU performance to illustrate the variable selection process. We load the required packages:

> require(leaps)
> require(MASS)

First up we consider selecting the best subset of a particular size, say four variables for illustrative purposes (nvmax argument), and we specify the largest possible model which in this example has six variables:

regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, nvmax = 4)

A summary for the output from this function is shown here:

> summary(reg1)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, nvmax = 4)
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 4
Selection Algorithm: exhaustive
         syct mmin mmax cach chmin chmax
1  ( 1 ) " "  " "  "*"  " "  " "   " "  
2  ( 1 ) " "  " "  "*"  "*"  " "   " "  
3  ( 1 ) " "  "*"  "*"  " "  " "   "*"  
4  ( 1 ) " "  "*"  "*"  "*"  " "   "*"

The function regsubsets identifies the variables mmin, mmax, cach and chmax as the best four.

Alternatively we could perform a backwards elimination and the function will indicate the best subset of a particular size, from one to six variables in this example:

> reg2 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, method = "backward")
> summary(reg2)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, method = "backward")
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: backward
         syct mmin mmax cach chmin chmax
1  ( 1 ) " "  "*"  " "  " "  " "   " "  
2  ( 1 ) " "  "*"  " "  " "  " "   "*"  
3  ( 1 ) " "  "*"  "*"  " "  " "   "*"  
4  ( 1 ) " "  "*"  "*"  "*"  " "   "*"  
5  ( 1 ) "*"  "*"  "*"  "*"  " "   "*"  
6  ( 1 ) "*"  "*"  "*"  "*"  "*"   "*"

The subset of four variables is the same for this example as the best subsets approach. The third approach if forward selection is used:

> reg3 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, method = "forward")
> summary(reg3)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, method = "forward")
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: forward
         syct mmin mmax cach chmin chmax
1  ( 1 ) " "  " "  "*"  " "  " "   " "  
2  ( 1 ) " "  " "  "*"  "*"  " "   " "  
3  ( 1 ) " "  "*"  "*"  "*"  " "   " "  
4  ( 1 ) " "  "*"  "*"  "*"  " "   "*"  
5  ( 1 ) "*"  "*"  "*"  "*"  " "   "*"  
6  ( 1 ) "*"  "*"  "*"  "*"  "*"   "*"

For this data set, as there are only six variables, we do not see divergence between the subsets chosen by the different methods.

Posted by Ralph at 12:15 pm 3 Comments »

3 responses to “Variable selection using automatic methods”

Ben Bolker says:

May 22, 2010 at 5:59 pm

Second the comment about the danger of these methods, e.g.

http://www.stata.com/support/faqs/stat/stepwise.html

Can you point to any relatively modern sources that suggest that this is a *good* idea … ? (Although http://dx.doi.org/10.1111/j.1461-0248.2009.01361.x shows that for some ecological data sets it’s not too bad. If we’re just going to
count votes, though, the anti-selectionists are much louder than the
pro-selectionists …)

!
Ralph says:

May 25, 2010 at 9:39 am

Opinion, as is often the case in technical discussions, seems to be polarised between manual and automatic variable selection methods. For me, one of the main issues with automatic selection methods is that they can be applied without too much thought and the answer taken as fact without due consideration of whether a good model has been identified.
Bhupendrasinh Thakre says:

October 14, 2012 at 7:38 pm

Typo in Third Approach.
You are mentioning to use “Forward” but code is for “Backward” (Edit and thanks for the spot).