6.7 - Further Examples

Example 6-4: Peruvian Blood Pressure Data Section

Machu Picchu, an Incan Citadel in Peru

This dataset consists of variables possibly relating to blood pressures of n = 39 Peruvians who have moved from rural high altitude areas to urban lower altitude areas (Peru data). The variables in this dataset are:

\(Y\) = systolic blood pressure
\(X_{1}\) = age
\(X_{2}\) = years in urban area
\(X_{3}\) = \(X_{2}\) /\(X_{1}\) = fraction of life in urban area
\(X_{4}\) = weight (kg)
\(X_{5}\) = height (mm)
\(X_{6}\) = chin skinfold
\(X_{7}\) = forearm skinfold
\(X_{8}\) = calf skinfold
\(X_{9}\) = resting pulse rate

First, we run a multiple regression using all nine x-variables as predictors. The results are given below.

Analysis of Variance
Source DF Adj SS Adj MS F- Value P-Value
Regression 9 4358.85 484.32 6.46 0.000
Error 29 2172.58 74.92    
Total 38 6531.44      
Model Summary
S R-sq R-sq(adj) R-sq(pred)
8.65544 66.74% 56.41% 34.45%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 146.8 49.0 3.00 0.006  
Age -1.121 0.327 -.343 0.002 3.21
Years 2.455 0.815 3.01 0.005 34.29
FracLife -115.3 30.2 -3.82 0.001 24.39
Weight  1.414 0.431 3.28 0.003 4.75
Height -0.0346 0.0369 -0.94 0.355 1.91
Chin -0.944 0.741 -1.27 0.213 2.06
Forearm -1.17 1.19 -0.98 0.335 3.80
Calf -0.159 0.537 -0.30 0.770 2.41
Pulse 0.115 0.170 0.67 0.507 1.33

When looking at tests for individual variables, we see that p-values for the variables Height, Chin, Forearm, Calf, and Pulse are not at a statistically significant level. These individual tests are affected by correlations amongst the x-variables, so we will use the General Linear F procedure to see whether it is reasonable to declare that all five non-significant variables can be dropped from the model.

Next, consider testing:

\(H_{0} \colon \beta_5 = \beta_6 = \beta_7 = \beta_8 = \beta_9 = 0\)
\(H_{A} \colon\)at least one of \(\beta_5 , \beta_6 , \beta_7, \beta_8 , \beta_9 \ne 0\)

within the nine variable model given above. If this null is not rejected, it is reasonable to say that none of the five variables Height, Chin, Forearm, Calf and Pulse contribute to the prediction/explanation of systolic blood pressure.

The full model includes all nine variables; SSE(full) = 2172.58, the full error df = 29, and MSE(full) = 74.92 (we get these from the Minitab results above). The reduced model includes only the variables Age, Years, fraclife, and Weight (which are the remaining variables if the five possibly non-significant variables are dropped). Regression results for the reduced model are given below.

Analysis of Variance
Source DF Adj SS Adj MS F- Value P-Value
Regression 4 3901.7 975.43 12.61 0.000
Error 34 2629.7 77.34    
Total 38 6531.4      
Model Summary
S R-sq R-sq(adj) R-sq(pred)
8.79456 59.74% 55.00% 44.84%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 116.8 22.0 5.32 0.000  
Age -0.951 0.316 -3.00 0.005 2.91
Years 2.339 0.771 3.03 0.005 29.79
FracLife -108.1 28.3 -3.81 0.001 20.83
Weight 0.832 0.275 3.02 0.005 1.88

We see that SSE(reduced) = 2629.7, and the reduced error df = 34. We also see that all four individual x-variables are statistically significant.

The calculation for the general linear F-test statistic is:

\(F=\dfrac{\frac{\text{SSE(reduced) - SSE(full)}}{\text{error df for reduced - error df for full}}}{\text{MSE(full)}}=\dfrac{\frac{2629.7-2172.58}{34-29}}{74.92}=1.220\)

Thus, this test statistic comes from an \(F_{5,29}\) distribution, of which the associated p-value is 0.325 (this can be done by using Calc >> Probability Distribution >> F in Minitab). This is not at a statistically significant level, so we do not reject the null hypothesis. Thus it is feasible to drop the variables \(X_{5}\), \(X_{6}\), \(X_{7}\), \(X_{8}\), and \(X_{9 }\) from the model.

Video: Testing a Subset of Predictors in a Multiple Linear Regression Model

Example 6-5: Measurements of College Students Section

For n = 55 college students, we have measurements (Physical dataset) for the following five variables:

\(Y\) = height (in)
\(X_{1 }\)= left forearm length (cm)
\(X_{2 }\)= left foot length (cm)
\(X_{3 }\)= head circumference (cm)
\(X_{4 }\)= nose length (cm)

The Minitab output for the full model is given below.

Coefficients
Term Coef SE Coef 95% CI T-Value P-Value VIF
Constant 18.50 7.83 ( 2.78, 34.23) 2.36 0.022  
LeftArm 0.802 0.171 ( 0.459, 1.145) 4.70 0.000 1.63
LeftFoot 0.997 0.162 ( 0.671, 1.323) 6.14 0.000 1.28
HeadCirc 0.081 0.150 (-0.220, 0.381) 0.54 0.593 1.28
nose -0.147 0.492 (-1.136, 0.841) -0.30 0.766 1.14
Regression Equation

Height = 18.50 + 0.802 LeftArm + 0.997 LeftFoot + 0.081 HeadCirc - 0.147 nose

Notice in the output that there are also t-test results provided. The interpretations of these t-tests are as follows:

  • The sample coefficients for LeftArm and LeftFoot achieve statistical significance. This indicates that they are useful as predictors of Height.
  • The sample coefficients for HeadCirc and nose are not significant. Each t-test considers the question of whether the variable is needed, given that all other variables will remain in the model.

Below is a plot of residuals versus the fitted values and it seems suitable.

Residuals Versus the Fitted Values plot

There is no obvious curvature and the variance is reasonably constant. One may note two possible outliers, but nothing serious.

The first calculation we will perform is for the general linear F-test. The results above might lead us to test

\(H_{0} \colon \beta_3 = \beta_4 = 0\)
\(H_{A} \colon\) at least one of \(\left( \beta_3 , \beta_4 \right) \ne 0\)

in the full model. If we fail to reject the null hypothesis, we could then remove both of HeadCirc and nose as predictors.

Below is the ANOVA table for the full model.

Analysis of Variance
Source DF Seq SS Seq MS F- Value P-Value
Regression 4 816.39 204.098 42.81 0.000
LeftArm 1 590.21 590.214 123.81 0.000
LeftFoot 1 224.35 224.349 47.06 0.000
headCirc 1 1.40 1.402 0.29 0.590
nose 1 0.43  0.427 0.09 0.766
Error 50 238.35 4.767    
Total 54 1054.75      

From this output, we see that SSE(full) = 238.35, with df = 50, and MSE(full) = 4.77. The reduced model includes only the two variables LeftArm and LeftFoot as predictors. The ANOVA results for the reduced model are found below.

Analysis of Variance
Source DF Seq SS Seq MS F- Value P-Value
Regression 2 814.56 407.281 88.18 0.000
LeftArm 1 590.21 590214 127.78 0.000
LeftFoot 1 224.35 224.349 48.57 0.000
Error 52 240.18 4.619    
Lack-of-Fit 44 175.14 3.980 0.49 0.937
Pure Error 8 65.04 8.130    
Total 54 1054.75      

From this output, we see that SSE(reduced) = SSE\(\left( X_{1} , X_{2}\right)\) = 240.18, with df = 52, and MSE(reduced) = MSE\(\left(X_{1}, X_{2}\right) = 4.62\).

With these values obtained, we can now obtain the test statistic for testing \(H_{0} \colon \beta_3 = \beta_4 = 0\):

\(F=\dfrac{\frac{\text{SSE}(X_1, X_2) - \text{SSE(full)}}{\text{error df for reduced - error df for full}}}{\text{MSE(full)}}=\dfrac{\frac{240.18-238.35}{52-50}}{4.77}=0.192\)

This value comes from an \(F_{2,50}\) distribution. By using Calc >> Probability Distribution >> F in Minitab, we learn that the area to the left of F = 0.192 (with df of 2 and 50) is 0.174. The p-value is the area to the right of F, so p = 1 − 0.174 = 0.826. Thus, we do not reject the null hypothesis and it is reasonable to remove HeadCirc and nose from the model.

Next we consider what fraction of variation in Y = Height cannot be explained by \(X_{2}\) = LeftFoot, but can be explained by \(X_{1}\) = LeftArm? To answer this question, we calculate the partial \(R^{2}\). The formula is:

\(R_{Y, 1|2}^{2}=\dfrac{SSR(X_1|X_2)}{SSE(X_2)}=\dfrac{SSE(X_2)-SSE(X_1,X_2)}{SSE(X_2)}\)

The denominator, SSE\(\left(X_{2}\right)\), measures the unexplained variation in Y when \(X_{2 }\)is the predictor. The ANOVA table for this regression is found in below.

Analysis of Variance
Source DF Seq SS Seq MS F- Value P-Value
Regression 1 707.4 707.420 107.95 0.000
LeftFoot 1 707.4 707.420 107.95 0.000
Error 53 347.3 6.553    
Lack-of-Fit 19 113.0 5.948 0.86 0.625
Pure Error 34 234.3 6.892    
Total 54 1054.7      

These results give us SSE\(\left(X_{2}\right)\) = 347.3.

The numerator, SSE\(\left(X_{2}\right)\)–SSE\(\left(X_{1}, X_{2}\right)\), measures the further reduction in the SSE when \(X_{1}\) is added to the model. Results from the earlier Minitab output give us SSE\(\left(X_{1} , X_{2}\right)\) = 240.18 and now we can calculate:

\begin{align}R_{Y, 1|2}^{2}&=\dfrac{SSR(X_1|X_2)}{SSE(X_2)}=\dfrac{SSE(X_2)-SSE(X_1,X_2)}{SSE(X_2)}\\&=\dfrac{347.3-240.18}{347.3}=0.308\end{align}

Thus \(X_{1}\)= LeftArm explains 30.8% of the variation in Y = Height that could not be explained by \(X_{2}\) = LeftFoot.