In order to get a handle on this multicollinearity thing, let's first investigate the effects that uncorrelated predictors have on regression analyses. To do so, we'll investigate a "contrived" data set, in which the predictors are perfectly uncorrelated. Then, we'll investigate the second example of a "real" data set, in which the predictors are nearly uncorrelated. Our two investigations will allow us to summarize the effects that uncorrelated predictors have on regression analyses.
Then, on the next page, we'll investigate the effects that highly correlated predictors have on regression analyses. In doing so, we'll learn — and therefore be able to summarize — the various effects multicollinearity has on regression analyses.
What is the effect on regression analyses if the predictors are perfectly uncorrelated?
Consider the following matrix plot of the response y and two predictors \(x_{1}\) and \(x_{2}\), of a contrived data set (Uncorrelated Predictors data set), in which the predictors are perfectly uncorrelated:
As you can see there is no apparent relationship at all between the predictors \(x_{1}\) and \(x_{2}\). That is, the correlation between \(x_{1}\) and \(x_{2}\) is zero:
Pearson correlation of x1 and x2 = 0.000
suggesting the two predictors are perfectly uncorrelated.
Now, let's just proceed quickly through the output of a series of regression analyses collecting various pieces of information along the way. When we're done, we'll review what we learned by collating the various items in a summary table.
The regression of the response y on the predictor \(x_{1}\):
Regression Analysis: y versus x1
Analysis of Variance
Source | DF | Seq SS | Seq MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 1 | 21.13 | 21.125 | 2.36 | 0.176 |
x1 | 1 | 21.13 | 21.125 | 2.36 | 0.176 |
Error | 6 | 53.75 | 8.958 | ||
Total | 7 | 74.88 |
Model Summary
S | R-sq | R-sq(adj) | R-sq(pred) |
---|---|---|---|
2.99305 | 28.21% | 16.25% | 0.00% |
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 52.75 | 3.35 | 15.76 | 0.000 | |
x1 | -1.62 | 1.06 | -1.54 | 0.176 | 1.00 |
Regression Equation
\(\widehat{y} = 52.75 - 1.62 x1\)
yields the estimated coefficient \(b_{1}\) = -1.62, the standard error se(\(b_{1}\)) = 1.06, and the regression sum of squares SSR(\(x_{1}\)) = 21.13.
The regression of the response y on the predictor \(x_{2}\):
Regression Analysis: y versus x2
Analysis of Variance
Source | DF | Seq SS | Seq MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 1 | 45.13 | 45.125 | 9.10 | 0.023 |
x2 | 1 | 45.13 | 45.125 | 9.10 | 0.023 |
Error | 6 | 29.75 | 4.958 | ||
Total | 7 | 74.88 |
Model Summary
S | R-sq | R-sq(adj) | R-sq(pred) |
---|---|---|---|
2.22673 | 60.27% | 53.64% | 29.36% |
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 62.13 | 4.79 | 12.97 | 0.000 | |
x2 | -2.375 | 0.787 | -3.02 | 0.023 | 1.00 |
Regression Equation
\(\widehat{y} = 62.13 - 2.375 x2\)
yields the estimated coefficient \(b_{2}\) = -2.375, the standard error se(\(b_{2}\)) = 0.787, and the regression sum of squares SSR(\(x_{2}\)) = 45.13.
The regression of the response y on the predictors \(x_{1 }\) and \(x_{2}\) (in that order):
Regression Analysis: y versus x1, x2
Analysis of Variance
Source | DF | Seq SS | Seq MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 2 | 66.250 | 33.125 | 19.20 | 0.005 |
x1 | 1 | 21.125 | 21.125 | 12.25 | 0.017 |
x2 | 1 | 45.125 | 45.125 | 26.16 | 0.004 |
Error | 5 | 8.625 | 1.725 | ||
Lack-of-Fit | 1 | 1.125 | 1.125 | 0.60 | 0.482 |
Pure Error | 4 | 7.500 | 1.875 | ||
Total | 7 | 74.875 |
Model Summary
S | R-sq | R-sq(adj) | R-sq(pred) |
---|---|---|---|
1.31339 | 88.48% | 83.87% | 70.51% |
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 67.00 | 3.15 | 21.27 | 0.000 | |
x1 | -1.625 | 0.464 | -3.50 | 0.017 | 1.00 |
x2 | -2.375 | 0.464 | -5.11 | 0.004 | 1.00 |
Regression Equation
\(\widehat{y} = 67.00 - 1.625 x1 - 2.375 x2\)
yields the estimated coefficients \(b_{1}\) = -1.625 and \(b_{2}\) = -2.375, the standard errors se(\(b_{1}\)) = 0.464 and se(\(b_{2}\)) = 0.464, and the sequential sum of squares SSR(\(x_{2}\)|\(x_{1}\)) = 45.125.
The regression of the response y on the predictors \(x_{2 }\) and \(x_{1}\) (in that order):
Regression Analysis: y versus x2, x1
Analysis of Variance
Source | DF | Seq SS | Seq MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 2 | 66.250 | 33.125 | 19.20 | 0.005 |
x2 | 1 | 45.125 | 45.125 | 26.16 | 0.004 |
x1 | 1 | 21.125 | 21.125 | 12.25 | 0.017 |
Error | 5 | 8.625 | 1.725 | ||
Lack-of-Fit | 1 | 1.125 | 1.125 | 0.60 | 0.482 |
Pure Error | 4 | 7.500 | 1.875 | ||
Total | 7 | 74.875 |
Model Summary
S | R-sq | R-sq(adj) | R-sq(pred) |
---|---|---|---|
1.31339 | 88.48% | 83.87% | 70.51% |
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 67.00 | 3.15 | 21.27 | 0.000 | |
x2 | -2.375 | 0.464 | -5.11 | 0.004 | 1.00 |
x1 | -1.625 | 0.464 | -3.50 | 0.017 | 1.00 |
Regression Equation
\(\widehat{y} = 67.00 - 2.375 x2 - 1.625 x1\)
yields the estimated coefficients \(b_{1}\) = -1.625 and \(b_{2}\) = -2.375, the standard errors se(\(b_{1}\)) = 0.464 and se(\(b_{2}\)) = 0.464, and the sequential sum of squares SSR(\(x_{1}\)|\(x_{2}\)) = 21.125.
Okay — as promised — compiling the results in a summary table, we obtain:
Model |
\(b_{1}\) | se(\(b_{1}\)) | \(b_{2}\) | se(\(b_{2}\)) | Seq SS |
---|---|---|---|---|---|
\(x_{1}\) only | -1.62 | 1.06 | --- | --- | SSR(\(x_{1}\)) 21.13 |
\(x_{2}\) only | --- | --- | -2.375 | 45.13 | SSR(\(x_{2}\)) 45.13 |
\(x_{1}\), \(x_{2}\) (in order) |
-1.625 | 0.464 | -2.375 | 0.464 | SSR(\(x_{2}\)|\(x_{1}\)) 21.125 |
\(x_{2}\), \(x_{1}\) (in order) |
-1.625 | 0.464 | -2.375 | 0.464 | SSR(\(x_{1}\)|\(x_{2}\)) 45.125 |
What do we observe?
- The estimated slope coefficients \(b_{1}\) and \(b_{2}\) are the same regardless of the model used.
- The standard errors se(\(b_{1}\)) and se(\(b_{2}\)) don't change much at all from model to model.
- The sum of squares SSR(\(x_{1}\)) is the same as the sequential sum of squares SSR(\(x_{1}\)|\(x_{2}\)). The sum of squares SSR(\(x_{2}\)) is the same as the sequential sum of squares SSR(\(x_{2}\)|\(x_{1}\)).
These all seem to be good things! Because the slope estimates stay the same, the effect on the response ascribed to a predictor doesn't depend on the other predictors in the model. Because SSR(\(x_{1}\)) = SSR(\(x_{1}\)|\(x_{2}\)), the marginal contribution that \(x_{1}\) has in reducing the variability in the response y doesn't depend on the predictor \(x_{2}\). Similarly, because SSR(\(x_{2}\)) = SSR(\(x_{2}\)|\(x_{1}\)), the marginal contribution that \(x_{2}\) has in reducing the variability in the response y doesn't depend on the predictor \(x_{1}\).
These are the things we can hope for in regression analysis — but, then reality sets in! Recall that we obtained the above results for a contrived data set, in which the predictors are perfectly uncorrelated. Do we get similar results for real data with only nearly uncorrelated predictors? Let's see!
What is the effect on regression analyses if the predictors are nearly uncorrelated? Section
To investigate this question, let's go back and take a look at the blood pressure data set (Blood Pressure data set). In particular, let's focus on the relationships among the response y = BP and the predictors \(x_{3}\) = BSA and \(x_{6}\) = Stress:
As the above matrix plot and the following correlation matrix suggest:
Correlation: BP, Age, Weight, BSA, Dur, Pulse, Stress
BP | Age | Weight | BSA | Dur | Pulse | |
---|---|---|---|---|---|---|
Age | 0.659 | |||||
Weight | 0.950 | 0.407 | ||||
BSA | 0.866 | 0.378 | 0.875 | |||
Dur | 0.293 | 0.344 | 0.201 | 0.131 | ||
Pulse | 0.721 | 0.619 | 0.659 | 0.465 | 0.402 | |
Stress | 0.164 | 0.368 | 0.034 | 0.018 | 0.312 | 0.506 |
there appears to be a strong relationship between y = BP and the predictor \(x_{3}\) = BSA (r = 0.866), a weak relationship between y = BP and \(x_{6}\) = Stress (r = 0.164), and an almost non-existent relationship between \(x_{3}\) = BSA and \(x_{6}\) = Stress (r = 0.018). That is, the two predictors are nearly perfectly uncorrelated.
What effect do these nearly perfectly uncorrelated predictors have on regression analyses? Let's proceed similarly through the output of a series of regression analyses collecting various pieces of information along the way. When we're done, we'll review what we learned by collating the various items in a summary table.
The regression of the response y = BP on the predictor \(x_{6}\)= Stress:
Analysis of Variance
Source | DF | Seq SS | Seq MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 1 | 15.04 | 15.04 | 0.50 | 0.490 |
Stress | 1 | 15.04 | 15.04 | 0.50 | 0.490 |
Error | 18 | 544.96 | 30.28 | ||
Lack-of-Fit | 14 | 457.79 | 32.70 | 1.50 | 0.374 |
Pure Error | 4 | 87.17 | 21.79 | ||
Total | 19 | 560.00 |
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 112.72 | 2.19 | 51.39 | 0.000 | |
Stress | 0.0240 | 0.0340 | 0.70 | 0.490 | 1.00 |
Regression Equation
\(\widehat{BP} = 112.72 + 0.0240 Stress\)
yields the estimated coefficient \(b_{6}\) = 0.0240, the standard error se(\(b_{6}\)) = 0.0340, and the regression sum of squares SSR(\(x_{6}\)) = 15.04.
The regression of the response y = BP on the predictor \(x_{3 }\)= BSA:
Analysis of Variance
Source | DF | Seq SS | Seq MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 1 | 419.858 | 419.858 | 53.93 | 0.000 |
BSA | 1 | 419.858 | 419.858 | 53.93 | 0.000 |
Error | 18 | 140.142 | 7.786 | ||
Lack-of-Fit | 13 | 133.642 | 10.280 | 7.91 | 0.016 |
Pure Error | 5 | 6.500 | 1.300 | ||
Total | 19 | 560.000 |
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 49.50 | 4.65 | 10.63 | 0.000 | |
BSA | 34.44 | 4.69 | 7.34 | 0.000 | 1.00 |
Regression Equation
\(\widehat{BP} = 45.18 + 34.44 BSA\)
yields the estimated coefficient \(b_{3}\) = 34.44, the standard error se(\(b_{3}\)) = 4.69, and the regression sum of squares SSR(\(x_{3}\)) = 419.858.
The regression of the response y = BP on the predictors \(x_{6}\)= Stress and \(x_{3}\)= BSA (in that order):
Analysis of Variance
Source | DF | Seq SS | Seq MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 2 | 432.12 | 216.058 | 28.72 | 0.000 |
Stress | 1 | 15.04 | 15.044 | 2.00 | 0.175 |
BSA | 1 | 417.07 | 417.073 | 55.44 | 0.000 |
Error | 17 | 127.88 | 7.523 | ||
Total | 19 | 560.00 |
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 44.24 | 9.26 | 4.78 | 0.000 | |
Stress | 0.0217 | 0.0170 | 1.28 | 0.219 | 1.00 |
BSA | 34.33 | 4.61 | 7.45 | 0.000 | 1.00 |
Regression Equation
\(\widehat{y} = 44.24 + 0.0217 Stress + 34.33 BSA\)
yields the estimated coefficients \(b_{6}\) = 0.0217 and \(b_{3}\) = 34.33, the standard errors se(\(b_{6}\)) = 0.0170 and se(\(b_{2}\)) = 4.61, and the sequential sum of squares SSR(\(x_{3}\)|\(x_{6}\)) = 417.07.
Finally, the regression of the response y = BP on the predictors \(x_{3 }\)= BSA and \(x_{6}\)= Stress (in that order):
Analysis of Variance
Source | DF | Seq SS | Seq MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 2 | 432.12 | 216.058 | 28.72 | 0.000 |
BSA | 1 | 419.86 | 419.858 | 55.81 | 0.000 |
Stress | 1 | 12.26 | 12.259 | 1.63 | 0.219 |
Error | 6 | 104.000 | 17.333 | ||
Total | 7 | 112.000 |
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 44.24 | 9.26 | 4.78 | 0.000 | |
BSA | 34.33 | 4.61 | 7.45 | 0.000 | 1.00 |
Stress | 0.0217 | 0.0170 | 1.28 | 0.219 | 1.00 |
Regression Equation
\(\widehat{BP} = 44.24 + 34.33 BSA + 0.0217 Stress\)
yields the estimated coefficients \(b_{6}\) = 0.0217 and \(b_{3}\) = 34.33, the standard errors se(\(b_{6}\)) = 0.0170 and se(\(b_{2}\)) = 4.61, and the sequential sum of squares SSR(\(x_{6}\)|\(x_{3}\)) = 12.26.
Again — as promised — compiling the results in a summary table, we obtain:
Model | \(b_{6}\) | se(\(b_{6}\)) | \(b_{3}\) | se(\(b_{3}\)) | Seq SS |
---|---|---|---|---|---|
\(x_{6}\) only | 0.0240 | 0.0340 | --- | --- | SSR(\(x_{6}\)) 15.04 |
\(x_{3}\) only | --- | --- | 34.44 | 4.69 | SSR(\(x_{3}\)) 419.858 |
\(x_{6}\), \(x_{3}\) (in order) |
0.0217 | 0.0170 | 34.33 | 4.61 | SSR(\(x_{3}\)|\(x_{6}\)) 417.07 |
\(x_{3}\), \(x_{6}\) (in order) |
0.0217 | 0.0170 | 34.33 | 4.61 | SSR(\(x_{6}\)|\(x_{3}\)) 12.26 |
What do we observe? If the predictors are nearly perfectly uncorrelated:
- We don't get identical, but very similar slope estimates \(b_{3}\) and \(b_{6}\), regardless of the predictors in the model.
- The sum of squares SSR(\(x_{3}\)) is not the same, but very similar to the sequential sum of squares SSR(\(x_{3}\)|\(x_{6}\)).
- The sum of squares SSR(\(x_{6}\)) is not the same, but very similar to the sequential sum of squares SSR(\(x_{6}\)|\(x_{3}\)).
Again, these are all good things! In short, the effect on the response ascribed to a predictor is similar regardless of the other predictors in the model. And, the marginal contribution of a predictor doesn't appear to depend much on the other predictors in the model.
Try it!
Uncorrelated predictors Section
Effect of perfectly uncorrelated predictor variables.
This exercise reviews the benefits of having perfectly uncorrelated predictor variables. The results of this exercise demonstrate a strong argument for conducting "designed experiments" in which the researcher sets the levels of the predictor variables in advance, as opposed to conducting an "observational study" in which the researcher merely observes the levels of the predictor variables as they happen. Unfortunately, many regression analyses are conducted on observational data rather than experimental data, limiting the strength of the conclusions that can be drawn from the data. As this exercise demonstrates, you should conduct an experiment, whenever possible, not an observational study. Use the (contrived) data stored in the Uncorrelated Predictor data set to complete this lab exercise.
-
Using the Stat >> Basic Statistics >> Correlation... command in Minitab, calculate the correlation coefficient between \(X_{1}\) and \(X_{2}\). Are the two variables perfectly uncorrelated?Correlation = 0 so, yes, the two variables are perfectly uncorrelated
-
Fit the simple linear regression model with y as the response and \(x_{1}\) as the single predictor:
- What is the value of the estimated slope coefficient \(b_{1}\)?
- What is the regression sum of squares, SSR (\(X_{1}\)), when \(x_{1}\) is the only predictor in the model?
Estimated slope coefficient \(b_1 = -5.80\)
\(SSR(X_1) = 336.40\)
-
Now, fit the simple linear regression model with y as the response and \(x_{2}\) as the single predictor:
- What is the value of the estimated slope coefficient \(b_{2}\)?
- What is the regression sum of squares, SSR (\(X_{2}\)), when \(x_{2}\) is the only predictor in the model?
Estimated slope coefficient \(b_2 = 1.36\).
\(SSR(X_2) = 206.2\).
-
Now, fit the multiple linear regression model with y as the response and \(x_{1}\) as the first predictor and \(x_{2}\) as the second predictor:
- What is the value of the estimated slope coefficient \(b_{1}\)? Is the estimate \(b_{1}\) different than that obtained when \(x_{1}\) was the only predictor in the model?
- What is the value of the estimated slope coefficient \(b_{2}\)? Is the estimate \(b_{2}\) different than that obtained when \(x_{2}\) was the only predictor in the model?
- What is the sequential sum of squares, SSR (\(X_{2}\)|\(X_{1}\))? Does the reduction in the error sum of squares when x2}\) is added to the model depend on whether \(x_{1}\) is already in the model?
Estimated slope coefficient \(b_1 = -5.80\), the same as before.
Estimated slope coefficient \(b_2 = 1.36\), the same as before.
\(SSR(X_2|X_1) = 206.2 = SSR(X_2)\), so this doesn’t depend on whether \(X_1\) is already in the model.
-
Now, fit the multiple linear regression model with y as the response and \(x_{2}\) as the first predictor, and \(x_{1}\) as the second predictor:
- What is the sequential sum of squares, SSR (\(X_{1}\)|\(X_{2}\))? Does the reduction in the error sum of squares when \(x_{1}\) is added to the model depend on whether \(x_{2}\) is already in the model?
\(SSR(X_2|X_1) = 336.4 = SSR(X_1)\), so this doesn’t depend on whether \(X_2\) is already in the model.
-
When the predictor variables are perfectly uncorrelated, is it possible to quantify the effect a predictor has on the response without regard to the other predictors?
Yes -
In what way does this exercise demonstrate the benefits of conducting a designed experiment rather than an observational study?
It is possible to quantify the effect a predictor has on the response regardless of whether other (uncorrelated) predictors have been included