When we have a set of data with a small number of variables we can easily use a manual approach to identifying a good set of variables and the form they take in our statistical model. In other situations we may have a large number of potentially important variables and it soon becomes a time consuming effort to follow a manual variable selection process. In this case we may consider using automatic subset selection tools to remove some of the burden of the task.
It should be noted that there is some disagreement about whether it is desirable to use an automated method but this post will focus on the mechanics of doing it rather than the debate about whether to be doing it at all.
The R package leaps has a function regsubsets that can be used for best subsets, forward selection and backwards elimination depending on which approach is considered most appropriate for the application under consideration.
In previous post we considered using data on CPU performance to illustrate the variable selection process. We load the required packages:
> require(leaps) > require(MASS) |
First up we consider selecting the best subset of a particular size, say four variables for illustrative purposes (nvmax argument), and we specify the largest possible model which in this example has six variables:
regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, nvmax = 4) |
A summary for the output from this function is shown here:
> summary(reg1) Subset selection object Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, nvmax = 4) 6 Variables (and intercept) Forced in Forced out syct FALSE FALSE mmin FALSE FALSE mmax FALSE FALSE cach FALSE FALSE chmin FALSE FALSE chmax FALSE FALSE 1 subsets of each size up to 4 Selection Algorithm: exhaustive syct mmin mmax cach chmin chmax 1 ( 1 ) " " " " "*" " " " " " " 2 ( 1 ) " " " " "*" "*" " " " " 3 ( 1 ) " " "*" "*" " " " " "*" 4 ( 1 ) " " "*" "*" "*" " " "*" |
The function regsubsets identifies the variables mmin, mmax, cach and chmax as the best four.
Alternatively we could perform a backwards elimination and the function will indicate the best subset of a particular size, from one to six variables in this example:
> reg2 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, method = "backward") > summary(reg2) Subset selection object Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, method = "backward") 6 Variables (and intercept) Forced in Forced out syct FALSE FALSE mmin FALSE FALSE mmax FALSE FALSE cach FALSE FALSE chmin FALSE FALSE chmax FALSE FALSE 1 subsets of each size up to 6 Selection Algorithm: backward syct mmin mmax cach chmin chmax 1 ( 1 ) " " "*" " " " " " " " " 2 ( 1 ) " " "*" " " " " " " "*" 3 ( 1 ) " " "*" "*" " " " " "*" 4 ( 1 ) " " "*" "*" "*" " " "*" 5 ( 1 ) "*" "*" "*" "*" " " "*" 6 ( 1 ) "*" "*" "*" "*" "*" "*" |
The subset of four variables is the same for this example as the best subsets approach. The third approach if forward selection is used:
> reg3 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, method = "forward") > summary(reg3) Subset selection object Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, method = "forward") 6 Variables (and intercept) Forced in Forced out syct FALSE FALSE mmin FALSE FALSE mmax FALSE FALSE cach FALSE FALSE chmin FALSE FALSE chmax FALSE FALSE 1 subsets of each size up to 6 Selection Algorithm: forward syct mmin mmax cach chmin chmax 1 ( 1 ) " " " " "*" " " " " " " 2 ( 1 ) " " " " "*" "*" " " " " 3 ( 1 ) " " "*" "*" "*" " " " " 4 ( 1 ) " " "*" "*" "*" " " "*" 5 ( 1 ) "*" "*" "*" "*" " " "*" 6 ( 1 ) "*" "*" "*" "*" "*" "*" |
For this data set, as there are only six variables, we do not see divergence between the subsets chosen by the different methods.
Second the comment about the danger of these methods, e.g.
http://www.stata.com/support/faqs/stat/stepwise.html
Can you point to any relatively modern sources that suggest that this is a *good* idea … ? (Although http://dx.doi.org/10.1111/j.1461-0248.2009.01361.x shows that for some ecological data sets it’s not too bad. If we’re just going to
count votes, though, the anti-selectionists are much louder than the
pro-selectionists …)
!
Opinion, as is often the case in technical discussions, seems to be polarised between manual and automatic variable selection methods. For me, one of the main issues with automatic selection methods is that they can be applied without too much thought and the answer taken as fact without due consideration of whether a good model has been identified.
Typo in Third Approach.
You are mentioning to use “Forward” but code is for “Backward” (Edit and thanks for the spot).