12.2 - Uncorrelated Predictors

In order to get a handle on this multicollinearity thing, let's first investigate the effects that uncorrelated predictors have on regression analyses. To do so, we'll investigate a "contrived" data set, in which the predictors are perfectly uncorrelated. Then, we'll investigate a second example of a "real" data set, in which the predictors are nearly uncorrelated. Our two investigations will allow us to summarize the effects that uncorrelated predictors have on regression analyses.

Then, on the next page, we'll investigate the effects that highly correlated predictors have on regression analyses. In doing so, we'll learn — and therefore be able to summarize — the various effects multicollinearity has on regression analyses.

What is the effect on regression analyses if the predictors are perfectly uncorrelated?

Consider the following matrix plot of the response y and two predictors \(x_{1}\) and \(x_{2}\), of a contrived data set (Uncorrelated Predictors data set), in which the predictors are perfectly uncorrelated:

matrix plot

As you can see there is no apparent relationship at all between the predictors \(x_{1}\) and \(x_{2}\). That is, the correlation between \(x_{1}\) and \(x_{2}\) is zero:

Pearson correlation of 1 and x2 = 0.000

 

suggesting the two predictors are perfectly uncorrelated.

Now, let's just proceed quickly through the output of a series of regression analyses collecting various pieces of information along the way. When we're done, we'll review what we learned by collating the various items in a summary table.

The regression of the response y on the predictor \(x_{1}\):

Regression Analysis: y versus x1

Analysis of Variance
Source DF Seq SS Seq MS F-Value P-Value
Regression 1 8.000 8.000 0.46 0.522
x1 1 8.000 8.000 0.46 0.522
Error 6 104.000 17.333    
Total 7 112.000      
Model Summary
S R-sq R-sq(adj) R-sq(pred)
4.16333 7.14% 0.00% 0.00%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 49.50 4.65 10.63 0.000  
x1 -1.00 1.47 -0.68 0.522 1.00
Regression Equation

y = 49.50 - 1.00 x1

yields the estimated coefficient \(b_{1}\) = -1.00, the standard error se(\(b_{1}\)) = 1.47, and the regression sum of squares SSR(\(x_{1}\)) = 8.000.

The regression of the response y on the predictor \(x_{2}\):

Regression Analysis: y versus x2

Analysis of Variance
Source DF Seq SS Seq MS F-Value P-Value
Regression 1 24.50 24.50 1.68 0.243
x2 1 24.50 24.50 1.68 0.243
Error 6 87.50 14.58    
Total 7 112.000      
Model Summary
S R-sq R-sq(adj) R-sq(pred)
3.81881 21.88% 8.85% 0.00%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 57.00 8.21 6.94 0.000  
x2 -1.75 1.35 -1.30 0.243 1.00
Regression Equation

y = 57.00 - 1.75 x2

yields the estimated coefficient \(b_{2}\) = -1.75, the standard error se(\(b_{2}\)) = 1.35, and the regression sum of squares SSR(\(x_{2}\)) = 24.50.

The regression of the response y on the predictors \(x_{1 }\) and \(x_{2}\) (in that order):

Regression Analysis: y versus x1, x2

Analysis of Variance
Source DF Seq SS Seq MS F-Value P-Value
Regression 2 32.500 16.250 1.02 0.424
x1 1 8.000 8.000 0.50 0.510
x2 1 24.500 24.500 1.54 0.270
Error 5 79.500 15.900    
Lack-of-Fit 1 24.500 24.500 1.78 0.253
Pure Error 4 55.000 13.750    
Total 7 112.000      
Model Summary
S R-sq R-sq(adj) R-sq(pred)
3.98748 29.02% 0.62% 0.00%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 60.000 9.56 6.28 0.002  
x1 -1.00 1.41 -0.71 0.510 1.00
x2 -1.75 1.41 -1.24 0.270 1.00
Regression Equation

y = 60.00 - 1.00 x1 - 1.75 x2

 

yields the estimated coefficients \(b_{1}\) = -1.00 and \(b_{2}\) = -1.75, the standard errors se(\(b_{1}\)) = 1.41 and se(\(b_{2}\)) = 1.41, and the sequential sum of squares SSR(\(x_{2}\)|\(x_{1}\)) = 24.500.

The regression of the response y on the predictors \(x_{2 }\) and \(x_{1}\) (in that order):

Regression Analysis: y versus x2, x1

Analysis of Variance
Source DF Seq SS Seq MS F-Value P-Value
Regression 2 32.500 16.250 1.02 0.424
x2 1 24.500 24.500 1.54 0.270
x1 1 8.000 8.000 0.50 0.510
Error 5 79.500 15.900    
Lack-of-Fit 1 24.500 24.500 1.78 0.253
Pure Error 4 55.000 13.750    
Total 7 112.000      
Model Summary
S R-sq R-sq(adj) R-sq(pred)
3.98748 29.02% 0.62% 0.00%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 60.000 9.56 6.28 0.002  
x2 -1.75 1.41 -1.24 0.270 1.00
x2 -1.00 1.41 -0.71 0.510 1.00
Regression Equation

y = 60.00 - 1.75 x2 - 1.00 x1

 

yields the estimated coefficients \(b_{1}\) = -1.00 and \(b_{2}\) = -1.75, the standard errors se(\(b_{1}\)) = 1.41 and se(\(b_{2}\)) = 1.41, and the sequential sum of squares SSR(\(x_{1}\)|\(x_{2}\)) = 8.000.

Okay — as promised — compiling the results in a summary table, we obtain:

Model

\(b_{1}\) se(\(b_{1}\)) \(b_{2}\) se(\(b_{2}\)) Seq SS
\(x_{1}\) only -1.00 1.47 --- --- SSR(\(x_{1}\))
8.000
\(x_{2}\) only --- --- -1.75 1.35 SSR(\(x_{2}\))
24.50
\(x_{1}\), \(x_{2}\)
(in order)
-1.00 1.41 -1.75 1.41 SSR(\(x_{2}\)|\(x_{1}\))
24.500
\(x_{2}\), \(x_{1}\)
(in order)
-1.00 1.41 -1.75 1.41 SSR(\(x_{1}\)|\(x_{2}\))
8.000

What do we observe?

  • The estimated slope coefficients \(b_{1}\) and \(b_{2}\) are the same regardless of the model used.
  • The standard errors se(\(b_{1}\)) and se(\(b_{2}\)) don't change much at all from model to model.
  • The sum of squares SSR(\(x_{1}\)) is the same as the sequential sum of squares SSR(\(x_{1}\)|\(x_{2}\)). The sum of squares SSR(\(x_{2}\)) is the same as the sequential sum of squares SSR(\(x_{2}\)|\(x_{1}\)).

These all seem to be good things! Because the slope estimates stay the same, the effect on the response ascribed to a predictor doesn't depend on the other predictors in the model. Because SSR(\(x_{1}\)) = SSR(\(x_{1}\)|\(x_{2}\)), the marginal contribution that \(x_{1}\) has in reducing the variability in the response y doesn't depend on the predictor \(x_{2}\). Similarly, because SSR(\(x_{2}\)) = SSR(\(x_{2}\)|\(x_{1}\)), the marginal contribution that \(x_{2}\) has in reducing the variability in the response y doesn't depend on the predictor \(x_{1}\).

These are the things we can hope for in a regression analysis — but, then reality sets in! Recall that we obtained the above results for a contrived data set, in which the predictors are perfectly uncorrelated. Do we get similar results for real data with only nearly uncorrelated predictors? Let's see!

What is the effect on regression analyses if the predictors are nearly uncorrelated? Section

To investigate this question, let's go back and take a look at the blood pressure data set (Blood Pressure data set). In particular, let's focus on the relationships among the response y = BP and the predictors \(x_{3}\) = BSA and \(x_{6}\) = Stress:

plot

As the above matrix plot and the following correlation matrix suggest:

Correlation: BP, Age, Weight, BSA, Dur, Pulse, Stress
  BP Age Weight BSA Dur Pulse
Age 0.659          
Weight 0.950 0.407        
BSA 0.866 0.378 0.875      
Dur 0.293 0.344 0.201 0.131    
Pulse 0.721 0.619 0.659 0.465 0.402  
Stress 0.164 0.368 0.034 0.018 0.312 0.506

there appears to be a strong relationship between y = BP and the predictor \(x_{3}\) = BSA (r = 0.866), a weak relationship between y = BP and \(x_{6}\) = Stress (r = 0.164), and an almost non-existent relationship between \(x_{3}\) = BSA and \(x_{6}\) = Stress (r = 0.018). That is, the two predictors are nearly perfectly uncorrelated.

What effect do these nearly perfectly uncorrelated predictors have on regression analyses? Let's proceed similarly through the output of a series of regression analyses collecting various pieces of information along the way. When we're done, we'll review what we learned by collating the various items in a summary table.

The regression of the response y = BP on the predictor \(x_{6}\)= Stress:

Analysis of Variance
Source DF Seq SS Seq MS F-Value P-Value
Regression 1 15.04 15.04 0.50 0.490
Stress 1 15.04 15.04 0.50 0.490
Error 18 544.96 30.28    
   Lack-of-Fit 14 457.79 32.70 1.50 0.374
   Pure Error 4 87.17 21.79    
Total 19 560.00      
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 112.72 2.19 51.39 0.000  
Stress 0.0240 0.0340 0.70 0.490 1.00
Regression Equation

BP = 112.72 + 0.0240 Stress

yields the estimated coefficient \(b_{6}\) = 0.0240, the standard error se(\(b_{6}\)) = 0.0340, and the regression sum of squares SSR(\(x_{6}\)) = 15.04.

The regression of the response y = BP on the predictor \(x_{3 }\)= BSA:

Analysis of Variance
Source DF Seq SS Seq MS F-Value P-Value
Regression 1 419.858 419.858 53.93 0.000
BSA 1 419.858 419.858 53.93 0.000
Error 18 140.142 7.786    
   Lack-of-Fit 13 133.642 10.280 7.91 0.016
   Pure Error 5 6.500 1.300    
Total 19 560.000      
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 49.50 4.65 10.63 0.000  
BSA 34.44 4.69 7.34 0.000 1.00
Regression Equation

BP = 45.18 + 34.44 BSA

yields the estimated coefficient \(b_{3}\) = 34.44, the standard error se(\(b_{3}\)) = 4.69, and the regression sum of squares SSR(\(x_{3}\)) = 419.858.

The regression of the response y = BP on the predictors \(x_{6}\)= Stress and \(x_{3}\)= BSA (in that order):

Analysis of Variance
Source DF Seq SS Seq MS F-Value P-Value
Regression 2 432.12 216.058 28.72 0.000
Stress 1 15.04 15.044 2.00 0.175
BSA 1 417.07 417.073 55.44 0.000
Error 17 127.88 7.523    
Total 19 560.00      
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 44.24 9.26 4.78 0.000  
Stress 0.0217 0.0170 1.28 0.219 1.00
BSA 34.33 4.61 7.45 0.000 1.00
Regression Equation

y = 44.24 + 0.0217 Stress + 34.33 BSA

yields the estimated coefficients \(b_{6}\) = 0.0217 and \(b_{3}\) = 34.33, the standard errors se(\(b_{6}\)) = 0.0170 and se(\(b_{2}\)) = 4.61, and the sequential sum of squares SSR(\(x_{3}\)|\(x_{6}\)) = 417.07.

Finally, the regression of the response y = BP on the predictors \(x_{3 }\)= BSA and \(x_{6}\)= Stress (in that order):

Analysis of Variance
Source DF Seq SS Seq MS F-Value P-Value
Regression 2 432.12 216.058 28.72 0.000
BSA 1 419.86 419.858 55.81 0.000
Stress 1 12.26 12.259 1.63 0.219
Error 6 104.000 17.333    
Total 7 112.000      
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 44.24 9.26 4.78 0.000  
BSA 34.33 4.61 7.45 0.000 1.00
Stress 0.0217 0.0170 1.28 0.219 1.00
Regression Equation

BP = 44.24 + 34.33 BSA + 0.0217 Stress

yields the estimated coefficients \(b_{6}\) = 0.0217 and \(b_{3}\) = 34.33, the standard errors se(\(b_{6}\)) = 0.0170 and se(\(b_{2}\)) = 4.61, and the sequential sum of squares SSR(\(x_{6}\)|\(x_{3}\)) = 12.26.

Again — as promised — compiling the results in a summary table, we obtain:

Model \(b_{6}\) se(\(b_{6}\)) \(b_{3}\) se(\(b_{3}\)) Seq SS
\(x_{6}\) only 0.0240 0.0340 --- --- SSR(\(x_{6}\))
15.04
\(x_{3}\) only --- --- 34.44 4.69 SSR(\(x_{3}\))
419.858
\(x_{6}\), \(x_{3}\)
(in order)
0.0217 0.0170 34.33 4.61 SSR(\(x_{3}\)|\(x_{6}\))
417.07
\(x_{3}\), \(x_{6}\)
(in order)
0.0217 0.0170 34.33 4.61 SSR(\(x_{6}\)|\(x_{3}\))
12.26

What do we observe? If the predictors are nearly perfectly uncorrelated:

  • We don't get identical, but very similar slope estimates \(b_{3}\) and \(b_{6}\), regardless of the predictors in the model.
  • The sum of squares SSR(\(x_{3}\)) is not the same, but very similar to the sequential sum of squares SSR(\(x_{3}\)|\(x_{6}\)).
  • The sum of squares SSR(\(x_{6}\)) is not the same, but very similar to the sequential sum of squares SSR(\(x_{6}\)|\(x_{3}\)).

Again, these are all good things! In short, the effect on the response ascribed to a predictor is similar regardless of the other predictors in the model. And, the marginal contribution of a predictor doesn't appear to depend much on the other predictors in the model.

Try it!

Uncorrelated predictors Section

Effect of perfectly uncorrelated predictor variables.

This exercise reviews the benefits of having perfectly uncorrelated predictor variables. The results of this exercise demonstrate a strong argument for conducting "designed experiments" in which the researcher sets the levels of the predictor variables in advance, as opposed to conducting an "observational study" in which the researcher merely observes the levels of the predictor variables as they happen. Unfortunately, many regression analyses are conducted on observational data rather than experimental data, limiting the strength of the conclusions that can be drawn from the data. As this exercise demonstrates, you should conduct an experiment, whenever possible, not an observational study. Use the (contrived) data stored in Uncorrelated Predictor data set to complete this lab exercise.

  1. Using the Stat >> Basic Statistics >> Correlation... command in Minitab, calculate the correlation coefficient between \(X_{1}\) and \(X_{2}\). Are the two variables perfectly uncorrelated?
    Correlation = 0 so, yes, the two variables are perfectly uncorrelated
  2. Fit the simple linear regression model with y as the response and \(x_{1}\) as the single predictor:

    • What is the value of the estimated slope coefficient \(b_{1}\)?
    • What is the regression sum of squares, SSR (\(X_{1}\)), when \(x_{1}\) is the only predictor in the model?

    Estimated slope coefficient \(b_1 = -5.80\)

    \(SSR(X_1) = 336.40\)

  3. Now, fit the simple linear regression model with y as the response and \(x_{2}\) as the single predictor:

    • What is the value of the estimated slope coefficient \(b_{2}\)?
    • What is the regression sum of squares, SSR (\(X_{2}\)), when \(x_{2}\) is the only predictor in the model?

    Estimated slope coefficient \(b_2 = 1.36\).

    \(SSR(X_2) = 206.2\).

  4. Now, fit the multiple linear regression model with y as the response and \(x_{1}\) as the first predictor and \(x_{2}\) as the second predictor:

    • What is the value of the estimated slope coefficient \(b_{1}\)? Is the estimate \(b_{1}\) different than that obtained when \(x_{1}\) was the only predictor in the model?
    • What is the value of the estimated slope coefficient \(b_{2}\)? Is the estimate \(b_{2}\) different than that obtained when \(x_{2}\) was the only predictor in the model?
    • What is the sequential sum of squares, SSR (\(X_{2}\)|\(X_{1}\))? Does the reduction in the error sum of squares when x2}\) is added to the model depend on whether \(x_{1}\) is already in the model?

    Estimated slope coefficient \(b_1 = -5.80\), the same as before.

    Estimated slope coefficient \(b_2 = 1.36\), the same as before.

    \(SSR(X_2|X_1) = 206.2 = SSR(X_2)\), so this doesn’t depend on whether \(X_1\) is already in the model.

  5. Now, fit the multiple linear regression model with y as the response and \(x_{2}\) as the first predictor and \(x_{1}\) as the second predictor:

    • What is the sequential sum of squares, SSR (\(X_{1}\)|\(X_{2}\))? Does the reduction in the error sum of squares when \(x_{1}\) is added to the model depend on whether \(x_{2}\) is already in the model?

    \(SSR(X_2|X_1) = 336.4 = SSR(X_1)\), so this doesn’t depend on whether \(X_2\) is already in the model.

  6. When the predictor variables are perfectly uncorrelated, is it possible to quantify the effect a predictor has on the response without regard to the other predictors?

    Yes
  7. In what way does this exercise demonstrate the benefits of conducting a designed experiment rather than an observational study?

    It is possible to quantify the effect a predictor has on the response regardless of whether other (uncorrelated) predictors have been included