##
Example 6-4: Peruvian Blood Pressure Data
Section* *

This dataset consists of variables possibly relating to blood pressures of *n *= 39 Peruvians who have moved from rural high altitude areas to urban lower altitude areas (Peru data). The variables in this dataset are:

\(Y\) = systolic blood pressure

\(X_{1}\) = age

\(X_{2}\) = years in urban area

\(X_{3}\) = \(X_{2}\) /\(X_{1}\) = fraction of life in urban area

\(X_{4}\) = weight (kg)

\(X_{5}\) = height (mm)

\(X_{6}\) = chin skinfold

\(X_{7}\) = forearm skinfold

\(X_{8}\) = calf skinfold

\(X_{9}\) = resting pulse rate

First, we run a multiple regression using all nine *x*-variables as predictors. The results are given below.

##### Analysis of Variance

Source | DF | Adj SS | Adj MS | F- Value | P-Value |
---|---|---|---|---|---|

Regression | 9 | 4358.85 | 484.32 | 6.46 | 0.000 |

Error | 29 | 2172.58 | 74.92 | ||

Total |
38 | 6531.44 |

##### Model Summary

S | R-sq | R-sq(adj) | R-sq(pred) |
---|---|---|---|

8.65544 | 66.74% | 56.41% | 34.45% |

##### Coefficients

Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|

Constant | 146.8 | 49.0 | 3.00 | 0.006 | |

Age | -1.121 | 0.327 | -.343 | 0.002 | 3.21 |

Years | 2.455 | 0.815 | 3.01 | 0.005 | 34.29 |

FracLife | -115.3 | 30.2 | -3.82 | 0.001 | 24.39 |

Weight | 1.414 | 0.431 | 3.28 | 0.003 | 4.75 |

Height | -0.0346 | 0.0369 | -0.94 | 0.355 | 1.91 |

Chin | -0.944 | 0.741 | -1.27 | 0.213 | 2.06 |

Forearm | -1.17 | 1.19 | -0.98 | 0.335 | 3.80 |

Calf | -0.159 | 0.537 | -0.30 | 0.770 | 2.41 |

Pulse | 0.115 | 0.170 | 0.67 | 0.507 | 1.33 |

When looking at tests for individual variables, we see that *p*-values for the variables **Height**, **Chin**, **Forearm**, **Calf**, and **Pulse **are not at a statistically significant level. These individual tests are affected by correlations amongst the *x*-variables, so we will use the **General Linear F*** *procedure to see whether it is reasonable to declare that all five non-significant variables can be dropped from the model.

Next, consider testing:

\(H_{0} \colon \beta_5 = \beta_6 = \beta_7 = \beta_8 = \beta_9 = 0\)

\(H_{A} \colon\)at least one of \(\beta_5 , \beta_6 , \beta_7, \beta_8 , \beta_9 \ne 0\)

within the nine variable model given above. If this null is not rejected, it is reasonable to say that none of the five variables **Height**, **Chin**, **Forearm**, **Calf **and **Pulse **contribute to the prediction/explanation of systolic blood pressure.

The *full model* includes all nine variables; SSE(full) = 2172.58, the full error df = 29, and MSE(full) = 74.92 (we get these from the Minitab results above). The *reduced model* includes only the variables **Age**, **Years**, **fraclife**, and **Weight **(which are the remaining variables if the five possibly non-significant variables are dropped). Regression results for the reduced model are given below.

##### Analysis of Variance

Source | DF | Adj SS | Adj MS | F- Value | P-Value |
---|---|---|---|---|---|

Regression | 4 | 3901.7 | 975.43 | 12.61 | 0.000 |

Error | 34 | 2629.7 | 77.34 | ||

Total |
38 | 6531.4 |

##### Model Summary

S | R-sq | R-sq(adj) | R-sq(pred) |
---|---|---|---|

8.79456 | 59.74% | 55.00% | 44.84% |

##### Coefficients

Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|

Constant | 116.8 | 22.0 | 5.32 | 0.000 | |

Age | -0.951 | 0.316 | -3.00 | 0.005 | 2.91 |

Years | 2.339 | 0.771 | 3.03 | 0.005 | 29.79 |

FracLife | -108.1 | 28.3 | -3.81 | 0.001 | 20.83 |

Weight | 0.832 | 0.275 | 3.02 | 0.005 | 1.88 |

We see that SSE(reduced) = 2629.7, and the reduced error df = 34. We also see that all four individual *x*-variables are statistically significant.

The calculation for the general linear *F*-test statistic is:

\(F=\dfrac{\frac{\text{SSE(reduced) - SSE(full)}}{\text{error df for reduced - error df for full}}}{\text{MSE(full)}}=\dfrac{\frac{2629.7-2172.58}{34-29}}{74.92}=1.220\)

Thus, this test statistic comes from an \(F_{5,29}\) distribution, of which the associated *p*-value is 0.325 (this can be done by using *Calc >> Probability Distribution >> F* in Minitab). This is not at a statistically significant level, so we do not reject the null hypothesis. Thus it is feasible to drop the variables \(X_{5}\), \(X_{6}\), \(X_{7}\), \(X_{8}\), and \(X_{9 }\) from the model.

#### Video: Testing a Subset of Predictors in a Multiple Linear Regression Model

##
Example 6-5: Measurements of College Students
Section* *

*For n = 55 college students, we have measurements *(Physical dataset) for the following five variables:

\(Y\) = height (in)

\(X_{1 }\)= left forearm length (cm)

\(X_{2 }\)= left foot length (cm)

\(X_{3 }\)= head circumference (cm)

\(X_{4 }\)= nose length (cm)

*The Minitab output for the full model is given below.*

##### Coefficients

Term | Coef | SE Coef | 95% CI | T-Value | P-Value | VIF |
---|---|---|---|---|---|---|

Constant | 18.50 | 7.83 | ( 2.78, 34.23) | 2.36 | 0.022 | |

LeftArm | 0.802 | 0.171 | ( 0.459, 1.145) | 4.70 | 0.000 | 1.63 |

LeftFoot | 0.997 | 0.162 | ( 0.671, 1.323) | 6.14 | 0.000 | 1.28 |

HeadCirc | 0.081 | 0.150 | (-0.220, 0.381) | 0.54 | 0.593 | 1.28 |

nose | -0.147 | 0.492 | (-1.136, 0.841) | -0.30 | 0.766 | 1.14 |

##### Regression Equation

Height = 18.50 + 0.802 LeftArm + 0.997 LeftFoot + 0.081 HeadCirc - 0.147 nose

*Notice in the output that there are also t-test results provided. The interpretations of these t-tests are as follows:*

*The sample coefficients for***LeftArm**and**LeftFoot**achieve statistical significance. This indicates that they are useful as predictors of**Height**.*The sample coefficients for***HeadCirc**and**nose**are not significant. Each t-test considers the question of whether the variable is needed, given that all other variables will remain in the model.

*Below is a plot of residuals versus the fitted values and it seems suitable.*

*There is no obvious curvature and the variance is reasonably constant. One may note two possible outliers, but nothing serious.*

*The first calculation we will perform is for the general linear F-test. The results above might lead us to test*

*\(H_{0} \colon \beta_3 = \beta_4 = 0\)
\(H_{A} \colon\) at least one of \(\left( \beta_3 , \beta_4 \right) \ne 0\)*

*in the full model. If we fail to reject the null hypothesis, we could then remove both of HeadCirc and nose as predictors.*

*Below is the ANOVA table for the full model.*

##### Analysis of Variance

Source | DF | Seq SS | Seq MS | F- Value | P-Value |
---|---|---|---|---|---|

Regression | 4 | 816.39 | 204.098 | 42.81 | 0.000 |

LeftArm | 1 | 590.21 | 590.214 | 123.81 | 0.000 |

LeftFoot | 1 | 224.35 | 224.349 | 47.06 | 0.000 |

headCirc | 1 | 1.40 | 1.402 | 0.29 | 0.590 |

nose | 1 | 0.43 | 0.427 | 0.09 | 0.766 |

Error | 50 | 238.35 | 4.767 | ||

Total |
54 | 1054.75 |

*From this output, we see that SSE(full) = 238.35, with df = 50, and MSE(full) = 4.77. The reduced model includes only the two variables LeftArm and LeftFoot as predictors. The ANOVA results for the reduced model are found below.*

##### Analysis of Variance

Source | DF | Seq SS | Seq MS | F- Value | P-Value |
---|---|---|---|---|---|

Regression | 2 | 814.56 | 407.281 | 88.18 | 0.000 |

LeftArm | 1 | 590.21 | 590214 | 127.78 | 0.000 |

LeftFoot | 1 | 224.35 | 224.349 | 48.57 | 0.000 |

Error | 52 | 240.18 | 4.619 | ||

Lack-of-Fit | 44 | 175.14 | 3.980 | 0.49 | 0.937 |

Pure Error | 8 | 65.04 | 8.130 | ||

Total |
54 | 1054.75 |

*From this output, we see that SSE(reduced) = SSE\(\left( X_{1} , X_{2}\right)\) = 240.18, with df = 52, and MSE(reduced) = MSE\(\left(X_{1}, X_{2}\right) = 4.62\).*

*With these values obtained, we can now obtain the test statistic for testing \(H_{0} \colon \beta_3 = \beta_4 = 0\):*

*\(F=\dfrac{\frac{\text{SSE}(X_1, X_2) - \text{SSE(full)}}{\text{error df for reduced - error df for full}}}{\text{MSE(full)}}=\dfrac{\frac{240.18-238.35}{52-50}}{4.77}=0.192\)*

*This value comes from an \(F_{2,50}\) distribution. By using Calc >> Probability Distribution >> F in Minitab, we learn that the area to the left of F = 0.192 (with df of 2 and 50) is 0.174. The p-value is the area to the right of F, so p = 1 − 0.174 = 0.826. Thus, we do not reject the null hypothesis and it is reasonable to remove HeadCirc and nose from the model.*

*Next we consider what fraction of variation in Y = Height cannot be explained by \(X_{2}\) = LeftFoot, but can be explained by \(X_{1}\) = LeftArm? To answer this question, we calculate the partial *\(R^{2}\)

*. The formula is:*

*\(R_{Y, 1|2}^{2}=\dfrac{SSR(X_1|X_2)}{SSE(X_2)}=\dfrac{SSE(X_2)-SSE(X_1,X_2)}{SSE(X_2)}\)*

*The denominator, SSE\(\left(X_{2}\right)\), measures the unexplained variation in Y when \(X_{2 }\)is the predictor. The ANOVA table for this regression is found in below.*

##### Analysis of Variance

Source | DF | Seq SS | Seq MS | F- Value | P-Value |
---|---|---|---|---|---|

Regression | 1 | 707.4 | 707.420 | 107.95 | 0.000 |

LeftFoot | 1 | 707.4 | 707.420 | 107.95 | 0.000 |

Error | 53 | 347.3 | 6.553 | ||

Lack-of-Fit | 19 | 113.0 | 5.948 | 0.86 | 0.625 |

Pure Error | 34 | 234.3 | 6.892 | ||

Total |
54 | 1054.7 |

*These results give us SSE\(\left(X_{2}\right)\) = 347.3.*

*The numerator, SSE\(\left(X_{2}\right)\)–SSE\(\left(X_{1}, X_{2}\right)\), measures the further reduction in the SSE when \(X_{1}\) is added to the model. Results from the earlier Minitab output give us SSE\(\left(X_{1} , X_{2}\right)\) = 240.18 and now we can calculate:*

*\begin{align}R_{Y, 1|2}^{2}&=\dfrac{SSR(X_1|X_2)}{SSE(X_2)}=\dfrac{SSE(X_2)-SSE(X_1,X_2)}{SSE(X_2)}\\&=\dfrac{347.3-240.18}{347.3}=0.308\end{align}*

*Thus \(X_{1}\)= LeftArm explains 30.8% of the variation in Y = Height that could not be explained by \(X_{2}\) = LeftFoot.*