There are situations in regression modelling where robust methods could be considered to handle unusual observations that do not follow the general trend of the data set. There are various packages in R that provide robust statistical methods which are summarised on the CRAN Robust Task View.
As an example of using robust statistical estimation in a linear regression framework consider the CPUs data that was used in previous posts on linear regression and variable selection. For this data we could fit a model with six variables using least squares and also with a fast MM-estimator from the robustbase package.
First step is to make the functions and data available for analysis:
require(MASS) require(robustbase) |
The linear model using least squares is fitted as follows:
> cpu.mod1 = lm(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus) |
The summary for this model:
> summary(cpu.mod1) Call: lm(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus) Residuals: Min 1Q Median 3Q Max -195.841 -25.169 5.409 26.528 385.749 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.590e+01 8.045e+00 -6.948 4.99e-11 *** syct 4.886e-02 1.752e-02 2.789 0.00579 ** mmin 1.529e-02 1.827e-03 8.371 9.42e-15 *** mmax 5.571e-03 6.418e-04 8.680 1.33e-15 *** cach 6.412e-01 1.396e-01 4.594 7.64e-06 *** chmin -2.701e-01 8.557e-01 -0.316 0.75263 chmax 1.483e+00 2.201e-01 6.738 1.64e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 59.99 on 202 degrees of freedom Multiple R-squared: 0.8649, Adjusted R-squared: 0.8609 F-statistic: 215.5 on 6 and 202 DF, p-value: < 2.2e-16 |
The linear model using an MM-estimator is fitted as follows:
cpu.robmod1 = lmrob(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, control = lmrob.control(max.it = 100)) |
Note that we need to increase the default number of iterations (50) to allow the routine to converge to a solution. The summary for this model:
> summary(cpu.robmod1) Call: lmrob(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus, control = lmrob.control(max.it = 100)) Weighted Residuals: Min 1Q Median 3Q Max -144.0045 -9.4554 0.7691 13.3757 759.6953 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.6634297 6.1697645 -0.594 0.553329 syct 0.0063112 0.0043877 1.438 0.151868 mmin 0.0098798 0.0034726 2.845 0.004897 ** mmax 0.0024463 0.0004525 5.407 1.80e-07 *** cach 0.8702102 0.2551245 3.411 0.000782 *** chmin 2.4078436 1.3319413 1.808 0.072130 . chmax 0.1016861 0.1494902 0.680 0.497145 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Robust residual standard error: 19.27 Convergence in 65 IRWLS iterations Robustness weights: 15 observations c(1,8,9,10,31,32,96,97,98,153,154,156,169,199,200) are outliers with |weight| = 0 ( < 0.00048); 14 weights are ~= 1. The remaining 180 ones are summarized as Min. 1st Qu. Median Mean 3rd Qu. Max. 0.002472 0.831300 0.963700 0.849800 0.989500 0.998900 Algorithmic parameters: tuning.chi bb tuning.psi refine.tol rel.tol 1.5476400 0.5000000 4.6850610 0.0000001 0.0000001 nResample max.it groups n.group best.r.s k.fast.s k.max trace.lev compute.rd 500 100 5 400 2 1 200 0 0 seed : int(0) |
The two models differ in the variables that are considered important and the output from the lmrob function provides a summary of the weights that have been allocated to the data. A total of fifteen of the data points have been allocated very small weights by the fitting algorithm.
Related posts:
- The update function for simplifying model selection.
- Data analysis using simple linear regression models.
- The dropterm function for simplifying models.