11.4 - Some Automated Variable Selection Examples

Let's take a look at a few more examples to see how the best subsets and stepwise regression procedures assist us in identifying a final regression model.

Example #1

image of cement Let's return one more time to the cement data example (cement.txt). Recall that the stepwise regression procedure:

minitab output

yielded the final stepwise model with y as the response and x₁ and x₂ as predictors.

The best subsets regression procedure:

minitab output

yields various models depending on the different criteria:

Based on the R²-value criterion, the "best" model is the model with the two predictors x₁ and x₂.
Based on the adjusted R²-value and MSE criteria, the "best" model is the model with the three predictors is the model with the three predictors x₁, x₂, and x₄.
Based on the C_p criterion, there are three possible "best" models — the model containing x₁and x₂; the model containing x₁, x₂ and x₃; and the model containing x₁, x₂ and x₄.

So, which model should we "go with"? That's where the final step — the refining step — comes into play. In the refining step, we evaluate each of the models identified by the best subsets and stepwise procedures to see if there is a reason to select one of the models over the other. This step may also involve adding interaction or quadratic terms, as well as transforming the response and/or predictors. And, certainly, when selecting a final model, don't forget why you are performing the research to begin with — the reason may make the choice of the model obvious.

Well, let's evaluate the three remaining candidate models. We don't have to go very far with the model containing the predictors x₁, x₂ and x₄:

minitab output

The variance inflation factors of 18.78 and 18.94 for x₂ and x₄ indicate that the model exhibits substantial multicollinearity. You may recall that the predictors x₂ and x₄ are strongly negatively correlated—indeed, r = -0.973.

While not perfect, the variance inflation factors for the model containing the predictors x₁, x₂ and x₃:

minitab output

are much better (smaller) than the previous variance inflation factors. But, unless there is a good scientific reason to go with this larger model, it probably makes more sense to go with the smaller, simpler model containing just the two predictors x₁ and x₂:

minitab output

For this model, the variance inflation factors are quite satisfactory (both 1.06), the adjusted R²-value (97.44%) is large, and the residual analysis yields no concerns. That is, the residuals versus fits plot:

plot

suggests that the relationship is indeed linear and that the variances of the error terms are constant. Furthermore, the normal probability plot:

normal probability plot

suggests that the error terms may not be normally distributed, but the Anderson-Darling normality test p-value is not significant at a 0.05 significance level. The regression model with y as the response and x₁ and x₂ as the predictors has been evaluated fully and appears to be ready to answer the researcher's questions.

Example #2

mri image of the brain Let's return to the brain size and body size study, in which the researchers were interested in determinig whether or not a person's brain size and body size are predictive of his or her intelligence? The researchers (Willerman, et al, 1991) collected the following data (iqsize.txt) on a sample of n = 38 college students:

Response (y): Performance IQ scores (PIQ) from the revised Wechsler Adult Intelligence Scale. This variable served as the investigator's measure of the individual's intelligence.
Potential predictor (x₁): Brain size based on the count obtained from MRI scans (given as count/10,000).
Potential predictor (x₂): Height in inches.
Potential predictor (x₃): Weight in pounds.

A matrix plot of the resulting data looks like:

matrix plot for IQ

The stepwise regression procedure:

minitab output

yielded the final stepwise model with PIQ as the response and Brain and Height as predictors. In this case, the best subsets regression procedure:

minitab output

yields the same model regardless of criterion used:

Based on the R²-value criterion, the "best" model is the model with the two predictors Brain and Height.
Based on the adjusted R²-value and MSE criteria, the "best" model is the model with the two predictors Brain and Height.
Based on the C_p criterion, the "best" model is the model with the two predictors Brain and Height.

Well, at least in this case, we have only one model to evaluate further:

For this model, the variance inflation factors are quite satisfactory (both 1.53), the adjusted R²-value (25.46%) is not great but can't get any better with these data, and the residual analysis yields no concerns. That is, the residuals versus fits plot:

plot

suggests that the relationship is indeed linear and that the variances of the error terms are constant. The researcher might want to investigate the one outlier, however. The normal probability plot:

suggests that the error terms are normally distributed. The regression model with PIQ as the response and Brain and Height as the predictors has been evaluated fully and appears to be ready to answer the researchers' questions.

Example #3

image of blood pressure measurement Let's return to the blood pressure study in which we observed the following data (bloodpress.txt) on 20 individuals with hypertension:

blood pressure (y = BP, in mm Hg)
age (x₁ = Age, in years)
weight (x₂ = Weight, in kg)
body surface area (x₃ = BSA, in sq m)
duration of hypertension (x₄ = Dur, in years)
basal pulse (x₅ = Pulse, in beats per minute)
stress index (x₆ = Stress)

The researchers were interested in determining if a relationship exists between blood pressure and age, weight, body surface area, duration, pulse rate and/or stress level.

The matrix plot of BP, Age, Weight, and BSA looks like:

matrix plot for Blood Pressure

and the matrix plot of BP, Dur, Pulse, and Stress looks like:

The stepwise regression procedure:

minitab output

yielded the final stepwise model with PIQ as the response and Age, Weight, and BSA (body surface area) as predictors. The best subsets regression procedure:

minitab output

yields various models depending on the different criteria:

Based on the R²-value criterion, the "best" model is the model with the two predictors Age and Weight.
Based on the adjusted R²-value and MSE criteria, the "best" model is the model with all six of the predictors — Age, Weight, BSA, Duration, Pulse, and Stress — in the model. However, one could easily argue that any number of sub-models are also satisfactory based on these criteria — such as the model containing Age, Weight, BSA, and Duration.
Based on the C_p criterion, a couple of models stand out — namely the model containing Age, Weight, and BSA; and the model containing Age, Weight, BSA, and Duration.

Incidentally, did you notice how large some of the C_p values are for some of the models? Those are the models that you should be concerned about exhibiting substantial bias. Don't worry too much about C_p values that are only slightly larger than p.

Here's a case in which I might argue for thinking practically over thinking statistically. There appears to be nothing substantially wrong with the two-predictor model containing Age and Weight:

minitab output

For this model, the variance inflation factors are quite satisfactory (both 1.20), the adjusted R²-value (99.04%) can't get much better, and the residual analysis yields no concerns. That is, the residuals versus fits plot:

is just right, suggesting that the relationship is indeed linear and that the variances of the error terms are constant. The normal probability plot:

suggests that the error terms are normally distributed.

Now, why might I prefer this model over the other legitimate contenders? It all comes down to simplicity! What's your age? What's your weight? Perhaps more than 90% of you know the answer to those two simple questions. But, now what is your body surface area? And, how long have you had hypertension? Answers to these last two questions are almost certainly less immediate for most (all?) people. Now, the researchers might have good arguments for why we should instead use the larger, more complex models. If that's the case, fine. But, if not, it is almost always best to go with the simpler model. And, certainly the model containing only Age and Weight is simpler than the other viable models.

11.4 - Some Automated Variable Selection Examples

Example #1

Example #2

Example #3

Navigation

Start Here!

Lessons

Resources