When fitting a multiple linear regression model to data a natural question is whether a model can be simplified by excluding variables from the model. There are automatic procedures for undertaking these tests but some people prefer to follow a more manual approach to variable selection rather than pressing a button and taking what comes out.
Fast Tube by Casper
When there are a large number of variables it is awkward to manually go through each one in turn to make a decision about simplification to a more parsimonious model. In R there is a function dropterm that removes some of this task by assuming that we are interested in considering the outcome of dropping each model term one at a time.
To illustrate this consider the cpus data set in the MASS package which contains information about a relative performance measure and characteristics of 209 CPUs. We load the package first to make the data available:
library(MASS) |
We first fit a linear model with six explanatory variables:
cpu.mod1 = lm(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus) |
The function dropterm requires a fitted model, which we saved in the last command, and optionally we could specify what test to use to compare the initial model and each of the possible alternative models with one less variable. We can choose to perform an F test:
> dropterm(cpu.mod1, test = "F") Single term deletions Model: perf ~ syct + mmin + mmax + cach + chmin + chmax Df Sum of Sq RSS AIC F Value Pr(F) <none> 727002 1718.3 syct 1 27995 754997 1724.2 7.779 0.005793 ** mmin 1 252211 979213 1778.5 70.078 9.416e-15 *** mmax 1 271147 998149 1782.5 75.339 1.326e-15 *** cach 1 75962 802964 1737.0 21.106 7.640e-06 *** chmin 1 358 727360 1716.4 0.100 0.752632 chmax 1 163396 890398 1758.6 45.400 1.640e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 |
The output from the function call indicates that we could excude the chmin variable then re-fit the model and continue again with the same checking process.
Update:
The dropterm function considers each variable individually and considers what the change in residual sum of squares would be if this variable was excluded from the model. There is a link between this F test and the t test that appears as part of the model summary – this is because of the link between these two distributions. For this model we would have:
> summary(cpu.mod1) Call: lm(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus) Residuals: Min 1Q Median 3Q Max -195.841 -25.169 5.409 26.528 385.749 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.590e+01 8.045e+00 -6.948 4.99e-11 *** syct 4.886e-02 1.752e-02 2.789 0.00579 ** mmin 1.529e-02 1.827e-03 8.371 9.42e-15 *** mmax 5.571e-03 6.418e-04 8.680 1.33e-15 *** cach 6.412e-01 1.396e-01 4.594 7.64e-06 *** chmin -2.701e-01 8.557e-01 -0.316 0.75263 chmax 1.483e+00 2.201e-01 6.738 1.64e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 59.99 on 202 degrees of freedom Multiple R-squared: 0.8649, Adjusted R-squared: 0.8609 F-statistic: 215.5 on 6 and 202 DF, p-value: < 2.2e-16 |
Let us consider the syct variable. The t statistic in the model summary is 2.789 and if we square this value we get 7.779 which is the F statistic produced by the dropterm function.
Related posts:
- The update function for simplifying model selection.
- Data analysis using simple linear regression models.
- Including factors in a regression model via analysis of covariance.
Hello
And thank you for this post. Could you give some details on the dropterm output? I suspect that this is not quite the same as teh coefficients from summary(), and I have trouble understanding what the cols/lines actually represent.
Hi,
I’ve updated the post to hopefully address your comments. In the dropterm we are looking at a goodness of fit between two models and the model summary is testing the signficance (or not) of a particular variable in the model. Unsurprising this amounts to answering the same question.
Hope this helps.
Hi,
Could you give some details on how i could go about using dropterm in a glm that uses many iterations? I have a huge dataset and the only way to identify non-significant terms in a glm is splitting the data into 10% random samples and then repeating the process. I am using a for loop for the iterations and would like to reduce the model using the dropterm function but i am not sure how i can use the outputs other than creating 10000 separate outputs which is infeasible to analyse. Any tips would be brilliant.
A quick thought – there is a biglm package in R that might do the trick. By this I mean use all of the data at the same time rather than having to fit separate models to each subset.
Could you describe whether the huge nature of the data is in the number of rows or columns, possibly both? How many subsets are you having to divide the data into?
Is the function “drop term” possible with linear mixed model too? I fitted my model with the nlme library , and then I don’t know how can I obtain my best model. One option is to following the prupose of Pinheiro paper (Model building using covariates in nonlinear mixed-effects models) using a forward stepwise approach.
what is your opinion?
Thanks in advance.
Martí Casals.
I haven’t used the drop term function with any models other than those produced by lm so can’t comment on whether they work correctly. My impression is that mixed effects models requires more user intervention as the theory isn’t as straightforward as for standard linear regression.
Hi Ralph,
Thank you so much!Finally I used the prupose of Pinheiro paper (Model building using covariates in nonlinear mixed-effects models) using a forward stepwise approach.
Martí
Good to hear that you managed to do what you wanted with nlme Marti!
Did you use addterm/dropterm or go for a manual approach of considering each model change individually?
Finally, I followed the strategy proposed by Pinheiro and Bates (2000) to select the final model. I used plots of the estimated random effects versus the candidate covariates to identify interesting patterns. A pattern in each step would indicate that the covariate will be included in the model. However, I also used the dropterm/addterm function and I remembered that the result was similar.
Martí
Please guide me on variable selection and variable reduction techniques in R.