Example 10-7: Peruvian Blood Pressure Data Section
First, we will illustrate Minitab’s “Best Subsets” procedure and a “by hand” calculation of the information criteria from earlier. Recall from Lesson 6 that this dataset consists of variables possibly relating to blood pressures of n = 39 Peruvians who have moved from rural high-altitude areas to urban lower-altitude areas (Peru dataset). The variables in this dataset (where we have omitted the calf skinfold variable from the first time we used this example) are:
Y = systolic blood pressure
\(X_1\) = age
\(X_2\) = years in urban area
\(X_3\) = \(X_2\) /\(X_1\) = fraction of life in urban area
\(X_4\) = weight (kg)
\(X_5\) = height (mm)
\(X_6\) = chin skinfold
\(X_7\) = forearm skinfold
\(X_8\) = resting pulse rate
Again, follow Stat > Regression > Regression > Best Subsets in Minitab. The results of this procedure are presented below.
Best Subsets Regressions: Systol versus Age, Years, ...
Response is Systol
Vars | R-Sq | R-Sq (adj) |
R-Sq (pred) |
Mallows Cp |
S | Age | Years | fraclife | Weight | Height | Chin | Forearm | Pulse |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 27.2 | 25.2 | 20.7 | 30.5 | 11.338 | X | |||||||
1 | 7.6 | 5.1 | 0.0 | 48.1 | 12.770 | X | |||||||
2 | 47.3 | 44.4 | 37.6 | 14.4 | 9.7772 | X | X | ||||||
2 | 42.1 | 38.9 | 30.3 | 19.1 | 10.251 | X | X | ||||||
3 | 50.3 | 46.1 | 38.6 | 13.7 | 9.6273 | X | X | X | |||||
3 | 49.0 | 4.7 | 34.2 | 14.8 | 9.7509 | X | X | X | |||||
4 | 59.7 | 55.0 | 44.8 | 7.2 | 8.7946 | X | X | X | X | ||||
4 | 52.5 | 46.9 | 31.0 | 13.7 | 9.5502 | X | X | X | X | ||||
5 | 63.9 | 58.4 | 45.6 | 5.5 | 8.4571 | X | X | X | X | X | |||
5 | 63.1 | 57.6 | 44.2 | 6.1 | 8.5419 | X | X | X | X | X | |||
6 | 64.9 | 58.3 | 3.3 | 6.6 | 8.4663 | X | X | X | X | X | X | ||
6 | 64.3 | 57.6 | 44.0 | 7.1 | 8.5337 | X | X | X | X | X | X | ||
7 | 66.1 | 58.4 | 42.6 | 7.5 | 8.4556 | X | X | X | X | X | X | X | |
7 | 65.5 | 57.7 | 41.3 | 8.0 | 8.5220 | X | X | X | X | X | X | X | |
8 | 66.6 | 57.7 | 39.9 | 9.0 | 8.5228 | X | X | X | X | X | X | X | X |
To interpret the results, we start by noting that the lowest \(C_p\) value (= 5.5) occurs for the five-variable model that includes the variables Age, Years, fraclife, Weight, and Chin. The ”X”s to the right side of the display tell us which variables are in the model (look up to the column heading to see the variable name). The value of \(R^{2}\) for this model is 63.9% and the value of \(R^{2}_{adj}\) is 58.4%. If we look at the best six-variable model, we see only minimal changes in these values, and the value of \(S = \sqrt{MSE}\) increases. A five-variable model most likely will be sufficient. We should then use multiple regression to explore the five-variable model just identified. Note that two of these x-variables relate to how long the person has lived at the urban lower altitude.
Next, we turn our attention to calculating AIC and BIC. Here are the multiple regression results for the best five-variable model (which has \(C_p\) = 5.5) and the best four-variable model (which has \(C_p\) = 7.2).
Best 5-variable model results:
Analysis of Variance
Source | DF | Adj SS | Adj MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 5 | 4171. | 834.24 | 11.66 | 0.000 |
Age | 1 | 782.6 | 782.65 | 10.94 | 0.002 |
Years | 1 | 751.2 | 751.19 | 10.50 | 0.003 |
fraclife | 1 | 1180.1 | 1180.14 | 1650 | 0.000 |
Weight | 1 | 970.3 | 970.26 | 13.57 | 0.001 |
Chin | 1 | 269.5 | 269.48 | 3.77 | 0.061 |
Error | 33 | 2360.2 | 71.52 | ||
Total | 38 | 6531.4 |
Model Summary
S | R-sq | R-sq(adj) | R-sq(pred) |
---|---|---|---|
8.45707 | 63.86% | 8.39% | 45.59% |
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 109.4 | 21.5 | 5.09 | 0.000 | |
Age |
-1.012 |
0.306 | -3.31 | 0.002 | 2.94 |
Years | 2.407 | 0.743 | 3.24 | 0.003 | 29.85 |
fraclife | -110.8 | 27.3 | -4.06 | 0.000 | 20.89 |
Weight | 1.098 | 0.298 | 3.68 | 0.001 | 2.38 |
Chin | -1.192 | 0.614 | -1.94 | 0.061 | 1.48 |
Regression Equation
\(\widehat{Systol} = 109.4 - 1.012 Age + 2.407 Years - 110.8 fraclife + 1.098 Weight - 1.192 Chin\)
Best 4-variable model results
Analysis of Variance
Source | DF | Adj SS | Adj MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 4 | 3901.7 | 975.43 | 12.61 | 0.000 |
Age | 1 | 698.1 | 698.07 | 9.03 | 0.005 |
Years | 1 | 711.2 | 711.20 | 9.20 | 0.005 |
fraclife | 1 | 1125.5 | 1125.55 | 14.55 | 0.001 |
Weight | 1 | 706.5 | 706.54 | 9.14 | 0.005 |
Error | 34 | 2629.7 | 77.34 | ||
Total | 38 | 6531.4 |
Model Summary
S | R-sq | R-sq(adj) | R-sq(pred) |
---|---|---|---|
8.79456 | 59.74% | 55.00% | 44.84% |
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 116.8 | 22.0 | 5.32 | 0.000 | |
Age |
-0.951 |
0.316 | -3.00 | 0.005 | 2.91 |
Years | 2.339 | 0.771 | 3.03 | 0.005 | 29.79 |
fraclife | -108.1 | 28.3 | -3.81 | 0.001 | 20.83 |
Weight | 0.832 | 0.275 | 3.02 | 0.005 | 1.88 |
Regression Equation
\(\widehat{Systol} = 116.8 - 0.951 Age + 2.339 Years - 108.1 fraclife + 0.832 Weight\)
AIC Comparison: The five-variable model still has a slight edge (a lower AIC is better).
- For the five-variable model:
\(AIC_p\) = 39 ln(2360.23) − 39 ln(39) + 2(6) = 172.015.
- For the four-variable model:
\(AIC_p\) = 39 ln(2629.71) − 39 ln(39) + 2(5) = 174.232.
BIC Comparison: The values are nearly the same; the five-variable model has a slightly lower value (a lower BIC is better).
- For the five-variable model:
\(BIC_p\) = 39 ln(2360.23) − 39 ln(39) + ln(39) × 6 = 181.997.
- For the four-variable model:
\(BIC_p\) = 39 ln(2629.71) − 39 ln(39) + ln(39) × 5 = 182.549.
Our decision is that the five-variable model has better values than the four-variable models, so it seems to be the winner. Interestingly, the Chin variable is not quite at the 0.05 level for significance in the five-variable model so we could consider dropping it as a predictor. But, the cost will be an increase in MSE and a 4.2% drop in \(R^{2}\). Given the closeness of the Chin value (0.061) to the 0.05 significance level and the relatively small sample size (39), we probably should keep the Chin variable in the model for prediction purposes. When we have a p-value that is only slightly higher than our significance level (by slightly higher, we mean usually no more than 0.05 above the significance level we are using), we usually say a variable is marginally significant. It is usually a good idea to keep such variables in the model, but one way or the other, you should state why you decided to keep or drop the variable.
Example 10-8: College Student Measurements Section
Next, we will illustrate stepwise procedures in Minitab. Recall from Lesson 6 that this dataset consists of n = 55 college students with measurements for the following seven variables (Physical dataset):
Y = height (in)
\(X_1\) = left forearm length (cm)
\(X_2\) = left foot length (cm)
\(X_3\) = left palm width
\(X_4\) = head circumference (cm)
\(X_5\) = nose length (cm)
\(X_6\) = gender, coded as 0 for male and 1 for female
Here is the output for Minitab’s stepwise procedure (Stat > Regression > Regression > Fit Regression Model, click Stepwise, select Stepwise for Method, select Include details for every step under Display the table of model selection details).
Stepwise Selection of Terms
Candidate terms: LeftArm, LeftFoot, LeftHand, headCirc, nose, Gender
Terms | -----Step 1----- | -----Step 2----- | ||
---|---|---|---|---|
Coef | P | Coef | P | |
Constant | 31.22 | 21.86 | ||
LeftFoot | 1.449 | 0.000 | 1.023 | 0.000 |
LeftArm | 0.796 | 0.000 | ||
S | 2.55994 | 2.14916 | ||
R-sq | 67.07% | 77.23% | ||
R-sq(adj) | 66.45% | 76.35% | ||
R-sq(pred) | 64.49% | 73.65% | ||
Mallows' Cp | 20.69 | 0.57 |
\(\alpha\) to remove 0.15
All six x-variables were candidates for the final model. The procedure took two forward steps and then stopped. The variables in the model at that point are left foot length and left forearm length. The left foot length variable was selected first (in Step 1), and then the left forearm length was added to the model. The procedure stopped because no other variables could enter at a significant level. Notice that the significance level used for entering variables was 0.15. Thus, after Step 2 there were no more x-variables for which the p-value would be less than 0.15.
It is also possible to have Minitab work backward from a model with all the predictors included and only consider steps in which the least significant predictor is removed. Output for this backward elimination procedure is given below.
Backward Elimination of Terms
Candidate terms: LeftArm, LeftFoot, LeftHand, headCirc, nose, Gender
Terms | -----Step 1----- | -----Step 2----- | -----Step 3----- | -----Step 4----- | -----Step 5----- | |||||
---|---|---|---|---|---|---|---|---|---|---|
Coef | P | Coef | P | Coef | P | Coef | P | Coef | P | |
Constant | 2.1 | 19.7 | 16.60 | 21.25 | 21.86 | |||||
LeftArm | 0.762 | 0.000 | 0.751 | 0.000 | 0.760 | 0.000 | 0.766 | 0.000 | 0.796 | 0.000 |
LeftFoot | 0.912 | 0.000 | 0.915 | 0.000 | 0.961 | 0.000 | 1.003 | 0.000 | 1.023 | 0.000 |
LeftHand | 0.191 | 0.510 | 0.198 | 0.490 | 0.248 | 0.332 | 0.225 | 0.370 | ||
HeadCirc | 0.076 | 0.639 | 0.081 | 0.611 | 0.100 | 0.505 | ||||
nose | -0.230 | 0.654 | ||||||||
Gender | -0.55 | 0.632 | -0.43 | 0.696 | ||||||
S | 2.20115 | 2.18317 | 2.16464 | 2.15296 | 2.14916 | |||||
R-sq | 77.95% | 77.86% | 77.79% | 77.59% | 77.23% | |||||
R-sq(adj) | 75.19% | 75.60% | 76.01% | 76.27% | 76.35% | |||||
R-sq(pred) | 70.58% | 71.34% | 72.20% | 73.27% | 73.65% | |||||
Mallows' Cp | 7.00 | 5.20 | 3.36 | 1.79 | 0.57 |
\(\alpha\) to remove 0.1
The procedure took five steps (counting Step 1 as the estimation of a model with all variables included). At each subsequent step, the weakest variable is eliminated until all variables in the model are significant (at the default 0.10 level). At a particular step, you can see which variable was eliminated by the new blank spot in the display (compared to the previous step). For instance, from Step 1 to Step 2, the nose length variable was dropped (it had the highest p-value.) Then, from Step 2 to Step 3, the gender variable was dropped, and so on.
The stopping point for the backward elimination procedure gave the same model as the stepwise procedure did, with left forearm length and left foot length as the only two x-variables in the model. It will not always necessarily be the case that the two methods used here will arrive at the same model.
Finally, it is also possible to have Minitab work forwards from a base model with no predictors included and only consider steps in which the most significant predictor is added. We leave it as an exercise to see how this forward selection procedure works for this dataset (you can probably guess given the results of the Stepwise procedure above).