10.9 - Further Examples

Example 10-7: Peruvian Blood Pressure Data Section

Machu Picchu, an Incan Citadel in Peru

First, we will illustrate Minitab’s “Best Subsets” procedure and a “by hand” calculation of the information criteria from earlier. Recall from Lesson 6 that this dataset consists of variables possibly relating to blood pressures of n = 39 Peruvians who have moved from rural high-altitude areas to urban lower-altitude areas (Peru dataset). The variables in this dataset (where we have omitted the calf skinfold variable from the first time we used this example) are:

Y = systolic blood pressure
\(X_1\) = age
\(X_2\) = years in urban area
\(X_3\) = \(X_2\) /\(X_1\) = fraction of life in urban area
\(X_4\) = weight (kg)
\(X_5\) = height (mm)
\(X_6\) = chin skinfold
\(X_7\) = forearm skinfold
\(X_8\) = resting pulse rate

Again, follow Stat > Regression > Regression > Best Subsets in Minitab. The results of this procedure are presented below.

Best Subsets Regressions: Systol versus Age, Years, ...

Response is Systol

Vars R-Sq R-Sq
(adj)
R-Sq
(pred)
Mallows
Cp
S Age Years fraclife Weight Height Chin Forearm Pulse
1 27.2 25.2 20.7 30.5 11.338       X        
1 7.6 5.1 0.0 48.1 12.770     X          
2 47.3 44.4 37.6 14.4 9.7772     X X        
2 42.1 38.9 30.3 19.1 10.251   X   X        
3 50.3 46.1 38.6 13.7 9.6273     X X   X    
3 49.0 4.7 34.2 14.8 9.7509   X X X        
4 59.7 55.0 44.8 7.2 8.7946 X X X X        
4 52.5 46.9 31.0 13.7 9.5502 X X X   X      
5 63.9 58.4 45.6 5.5 8.4571 X X X X   X    
5 63.1 57.6 44.2 6.1 8.5419 X X X X     X  
6 64.9 58.3 3.3 6.6 8.4663 X X X X   X X  
6 64.3 57.6 44.0 7.1 8.5337 X X X X X X    
7 66.1 58.4 42.6 7.5 8.4556 X X X X X X X  
7 65.5 57.7 41.3 8.0 8.5220 X X X X   X X X
8 66.6 57.7 39.9 9.0 8.5228 X X X X X X X X

To interpret the results, we start by noting that the lowest \(C_p\) value (= 5.5) occurs for the five-variable model that includes the variables Age, Years, fraclife, Weight, and Chin. The ”X”s to the right side of the display tell us which variables are in the model (look up to the column heading to see the variable name). The value of \(R^{2}\) for this model is 63.9% and the value of \(R^{2}_{adj}\) is 58.4%. If we look at the best six-variable model, we see only minimal changes in these values, and the value of \(S = \sqrt{MSE}\) increases. A five-variable model most likely will be sufficient. We should then use multiple regression to explore the five-variable model just identified. Note that two of these x-variables relate to how long the person has lived at the urban lower altitude.

Next, we turn our attention to calculating AIC and BIC. Here are the multiple regression results for the best five-variable model (which has \(C_p\) = 5.5) and the best four-variable model (which has \(C_p\) = 7.2).

Best 5-variable model results:

Analysis of Variance

Source DF Adj SS  Adj MS F-Value P-Value
Regression 5 4171. 834.24 11.66 0.000
Age 1 782.6 782.65 10.94 0.002
Years 1 751.2 751.19 10.50 0.003
fraclife 1 1180.1 1180.14 1650 0.000
Weight 1 970.3 970.26 13.57 0.001
Chin 1 269.5 269.48 3.77 0.061
Error 33 2360.2 71.52    
Total 38 6531.4      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
8.45707 63.86% 8.39% 45.59%

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant 109.4 21.5 5.09 0.000  
Age

-1.012

0.306 -3.31 0.002 2.94
Years 2.407 0.743 3.24 0.003 29.85
fraclife -110.8 27.3 -4.06 0.000 20.89
Weight 1.098 0.298 3.68 0.001 2.38
Chin -1.192 0.614 -1.94 0.061 1.48

Regression Equation

\(\widehat{Systol} = 109.4 - 1.012 Age + 2.407 Years - 110.8 fraclife + 1.098 Weight - 1.192 Chin\)

Best 4-variable model results

Analysis of Variance

Source DF Adj SS  Adj MS F-Value P-Value
Regression 4 3901.7 975.43 12.61 0.000
Age 1 698.1 698.07 9.03 0.005
Years 1 711.2 711.20 9.20 0.005
fraclife 1 1125.5 1125.55 14.55 0.001
Weight 1 706.5 706.54 9.14 0.005
Error 34 2629.7 77.34    
Total 38 6531.4      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
8.79456 59.74% 55.00% 44.84%

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant 116.8 22.0 5.32 0.000  
Age

-0.951

0.316 -3.00 0.005 2.91
Years 2.339 0.771 3.03 0.005 29.79
fraclife -108.1 28.3 -3.81 0.001 20.83
Weight 0.832 0.275 3.02 0.005 1.88

Regression Equation

\(\widehat{Systol} = 116.8 - 0.951 Age + 2.339 Years - 108.1 fraclife + 0.832 Weight\)

AIC Comparison: The five-variable model still has a slight edge (a lower AIC is better).

  • For the five-variable model:

\(AIC_p\) = 39 ln(2360.23) − 39 ln(39) + 2(6) = 172.015.

  • For the four-variable model:

\(AIC_p\) = 39 ln(2629.71) − 39 ln(39) + 2(5) = 174.232.

BIC Comparison: The values are nearly the same; the five-variable model has a slightly lower value (a lower BIC is better).

  • For the five-variable model:

\(BIC_p\) = 39 ln(2360.23) − 39 ln(39) + ln(39) × 6 = 181.997.

  • For the four-variable model:

\(BIC_p\) = 39 ln(2629.71) − 39 ln(39) + ln(39) × 5 = 182.549.

Our decision is that the five-variable model has better values than the four-variable models, so it seems to be the winner. Interestingly, the Chin variable is not quite at the 0.05 level for significance in the five-variable model so we could consider dropping it as a predictor. But, the cost will be an increase in MSE and a 4.2% drop in \(R^{2}\). Given the closeness of the Chin value (0.061) to the 0.05 significance level and the relatively small sample size (39), we probably should keep the Chin variable in the model for prediction purposes. When we have a p-value that is only slightly higher than our significance level (by slightly higher, we mean usually no more than 0.05 above the significance level we are using), we usually say a variable is marginally significant. It is usually a good idea to keep such variables in the model, but one way or the other, you should state why you decided to keep or drop the variable.

Example 10-8: College Student Measurements Section

College students walking to class

Next, we will illustrate stepwise procedures in Minitab. Recall from Lesson 6 that this dataset consists of n = 55 college students with measurements for the following seven variables (Physical dataset):

Y = height (in)
\(X_1\) = left forearm length (cm)
\(X_2\) = left foot length (cm)
\(X_3\) = left palm width
\(X_4\) = head circumference (cm)
\(X_5\) = nose length (cm)
\(X_6\) = gender, coded as 0 for male and 1 for female

Here is the output for Minitab’s stepwise procedure (Stat > Regression > Regression > Fit Regression Model, click Stepwise, select Stepwise for Method, select Include details for every step under Display the table of model selection details).

Stepwise Selection of Terms

Candidate terms: LeftArm, LeftFoot, LeftHand, headCirc, nose, Gender

Terms -----Step 1----- -----Step 2-----
Coef P Coef P
Constant 31.22   21.86  
LeftFoot 1.449 0.000 1.023 0.000
LeftArm     0.796 0.000
 
S   2.55994   2.14916
R-sq   67.07%   77.23%
R-sq(adj)   66.45%   76.35%
R-sq(pred)   64.49%   73.65%
Mallows' Cp   20.69   0.57

\(\alpha\) to remove 0.15

All six x-variables were candidates for the final model. The procedure took two forward steps and then stopped. The variables in the model at that point are left foot length and left forearm length. The left foot length variable was selected first (in Step 1), and then the left forearm length was added to the model. The procedure stopped because no other variables could enter at a significant level. Notice that the significance level used for entering variables was 0.15. Thus, after Step 2 there were no more x-variables for which the p-value would be less than 0.15.

It is also possible to have Minitab work backward from a model with all the predictors included and only consider steps in which the least significant predictor is removed. Output for this backward elimination procedure is given below.

Backward Elimination of Terms

Candidate terms: LeftArm, LeftFoot, LeftHand, headCirc, nose, Gender

Terms -----Step 1----- -----Step 2----- -----Step 3----- -----Step 4----- -----Step 5-----
Coef P Coef P Coef P Coef P Coef P
Constant 2.1   19.7   16.60   21.25   21.86  
LeftArm 0.762 0.000 0.751 0.000 0.760 0.000 0.766 0.000 0.796 0.000
LeftFoot 0.912 0.000 0.915 0.000 0.961 0.000 1.003 0.000 1.023 0.000
LeftHand 0.191 0.510 0.198 0.490 0.248 0.332 0.225 0.370    
HeadCirc 0.076 0.639 0.081 0.611 0.100 0.505        
nose -0.230 0.654                
Gender -0.55 0.632 -0.43 0.696            
 
S   2.20115   2.18317   2.16464   2.15296   2.14916
R-sq   77.95%   77.86%   77.79%   77.59%   77.23%
R-sq(adj)   75.19%   75.60%   76.01%   76.27%   76.35%
R-sq(pred)   70.58%   71.34%   72.20%   73.27%   73.65%
Mallows' Cp   7.00   5.20   3.36   1.79   0.57

\(\alpha\) to remove 0.1

The procedure took five steps (counting Step 1 as the estimation of a model with all variables included). At each subsequent step, the weakest variable is eliminated until all variables in the model are significant (at the default 0.10 level). At a particular step, you can see which variable was eliminated by the new blank spot in the display (compared to the previous step). For instance, from Step 1 to Step 2, the nose length variable was dropped (it had the highest p-value.) Then, from Step 2 to Step 3, the gender variable was dropped, and so on.

The stopping point for the backward elimination procedure gave the same model as the stepwise procedure did, with left forearm length and left foot length as the only two x-variables in the model. It will not always necessarily be the case that the two methods used here will arrive at the same model.

Finally, it is also possible to have Minitab work forwards from a base model with no predictors included and only consider steps in which the most significant predictor is added. We leave it as an exercise to see how this forward selection procedure works for this dataset (you can probably guess given the results of the Stepwise procedure above).