12.6 - Reducing Structural Multicollinearity

Recall that structural multicollinearity is multicollinearity that is a mathematical artifact caused by creating new predictors from other predictors, such as, creating the predictor \(x^{2}\) from the predictor x. Because of this, at the same time that we learn here about reducing structural multicollinearity, we learn more about polynomial regression models.

Example 12-4 Section

What is the impact of exercise on the human immune system? To answer this very global and general research question, one has to first quantify what "exercise" means and what "immunity" means. Of course, there are several ways of doing so. For example, we might quantify one's level of exercise by measuring his or her "maximal oxygen uptake." And, we might quantify the quality of one's immune system by measuring the amount of "immunoglobin in his or her blood." In doing so, the general research question is translated into a much more specific research question: "How is the amount of immunoglobin in the blood (y) related to maximal oxygen uptake (x)?"

Because some researchers were interested in answering the above research question, they collected the following Exercise and Immunity data set on a sample of 30 individuals:

  • \(y_i\) = amount of immunoglobin in the blood (mg) of individual i
  • \(x_i\) = maximal oxygen uptake (ml/kg) of individual i

The scatter plot of the resulting data:

scatterplot

suggests that there might be some curvature to the trend in the data. To allow for the apparent curvature — rather than formulating a linear regression function — the researchers formulated the following quadratic (polynomial) regression function:

\(y_i=\beta_0+\beta_1x_i+\beta_{11}x_{i}^{2}+\epsilon_i\)

where:

  • \(y_i\) = amount of immunoglobin in the blood (mg) of individual i
  • \(x_i\) = maximal oxygen uptake (ml/kg) of individual i

As usual, the error terms \(\epsilon_i\) are assumed to be independent, normally distributed, and have equal variance \(\sigma^{2}\).

As the following plot of the estimated quadratic function suggests:

fitted line plot

the formulated regression function appears to describe the trend in the data well. The adjusted \(R^{2} \text{-value}\) is 93.3%.

But, now what do the estimated coefficients tell us? The interpretation of the regression coefficients is mostly geometric in nature. That is, the coefficients tell us a little bit about what the picture looks like:

  • If 0 is a possible x value, then \(b_0\) is the predicted response when x = 0. Otherwise, the interpretation of \(b_0\) is meaningless.
  • The estimated coefficient \(b_1\) is the estimated slope of the tangent line at x = 0.
  • The estimated coefficient \(b_2\) indicates the up/down direction of the curve. That is:
    • if \(b_2\) < 0, then the curve is concave down
    • if \(b_2\) > 0, then the curve is concave up

So far, we have kept our heads a little bit in the sand! If we look at the output we obtain upon regressing the response y = igg on the predictors oxygen and \(\text{oxygen}^{2}\):

Analysis of Variance

Source DF Seq SS Seq MS F-Value P-Value
Regression 2 4602211 2301105 203.16 0.000
oxygen 1 4472047 4472047 394.83 0.000
oxygensq 1 130164 130164 11.49 0.002
Error 27 305818 11327    
Total 29 4908029      

Modal Summary

S R-sq R-sq(adj) R-sq(pred)
106.427 93.77% 93.31% 92.48%

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant -1464 411 -3.56 0.001  
oxygen 88.3 16.5 5.36 0.000 99.94
oxygensq -0.536 0.158 -3.39 0.002 99.94

we quickly see that the variance inflation factors for both predictors — oxygen, and oxygensq — are very large (99.9 in each case). Is this surprising to you? If you think about it, we've created a correlation by taking the predictor oxygen and squaring it to obtain oxygensq. That is, just by the nature of our model, we have created a "structural multicollinearity."

The scatter plot of oxygensq versus oxygen:

scatterplot

Pearson correlation of oxygen and oxygensq = 0.995

illustrates the intense strength of the correlation that we induced. After all, we just can't get much more correlated than a correlation of r = 0.995!

The neat thing here is that we can reduce the multicollinearity in our data by doing what is known as "centering the predictors." Centering a predictor merely entails subtracting the mean of the predictor values in the data set from each predictor value. For example, Minitab reports that the mean of the oxygen values in our data set is 50.64:

Descriptive Statistics: oxygen

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
oxygen 30 0 50.64 2.19 11.99 33.00 41.58 50.00 61.63 69.90

Therefore, to center the predictor oxygen, we merely subtract 50.64 from each oxygen value in our data set. Doing so, we obtain the centered predictor, oxcent, say:

oxygen oxcent oxcentsq
34.6 -16.04 257.282
45.0 -5.64 31.810
62.3 11.66 135.956
58.9 8.26 68.228
42.5 -8.14 66.260
44.3 -6.34 40.196
67.9 17.26 297.908
58.5 7.86 61.780
35.6 -15.04 226.202
49.6 -1.04 1.082
33.0 -17.64 311.170

For example, 34.6 minus 50.64 is -16.04, and 45.0 minus 50.64 is -5.64, and so on. Now, to include the squared oxygen term in our regression model — to allow for curvature in the trend — we square the centered predictor oxcent to obtain oxcentsq. That is, \(\left(-16.04\right)^{2} = 257.282\) and \(\left(-5.64\right)^{2} = 31.810\), and so on.

Wow! It really works! The scatter plot of oxcentsq versus oxcent:

scatterplot

Pearson correlation of oxcent and oxygensq = 0.219

illustrates — by centering the predictors — just how much we've reduced the correlation between the predictor and its square. The correlation has gone from a whopping r = 0.995 to a rather low r = 0.219!

Having centered the predictor oxygen, we must reformulate our quadratic polynomial regression model accordingly. That is, we now formulate our model as:

\(y_i=\beta_{0}^{*}+\beta_{1}^{*}(x_i-\bar{x})+\beta_{11}^{*}(x_i-\bar{x})^{2}+\epsilon_i\)

or alternatively:

\(y_i=\beta_{0}^{*}+\beta_{1}^{*}x_{i}^{*}+\beta_{11}^{*}x_{i}^{*2}+\epsilon_i\)

where:

  • \(y_i\) = amount of immunoglobin in the blood (mg), and
  • \(x_{i}^{*}=x_i-\bar{x}\) denotes the centered predictor

and the error terms \(\epsilon_i\) are independent, normally distributed and have equal variance \(\sigma^{2}\). Note that we add asterisks to each of the parameters to make it clear that the parameters differ from the parameters in the original model we formulated.

Let's see how we did by centering the predictors and reformulating our model. Recall that — based on our original model — the variance inflation factors for oxygen and oxygensq were 99.9. Now, regressing y = igg on the centered predictors oxcent and oxcentsq:

Analysis of Variance

Source DF Seq SS Seq MS F-Value P-Value
Regression 2 4602211 2301105 203.16 0.000
oxygen 1 4472047 4472047 394.83 0.000
oxygensq 1 130164 130164 11.49 0.002
Error 27 305818 11327    
Total 29 4908029      

Modal Summary

S R-sq R-sq(adj) R-sq(pred)
106.427 93.77% 93.31% 92.48%

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant 1632.3 29.3 55.62 0.000  
oxygen 34.00 1.69 20.13 0.000 1.05
oxygensq -0.536 0.158 -3.39 0.002 1.05

we see that the variance inflation factors have dropped significantly — now they are 1.05 in each case.

Because we reformulated our model based on the centered predictors, the meaning of the parameters must be changed accordingly. Now, the estimated coefficients tell us:

  • The estimated coefficient \(b_0\) is the predicted response y when the predictor x equals the sample mean of the predictor values.
  • The estimated coefficient \(b_1\) is the estimated slope of the tangent line at the predictor mean — and, often, it is similar to the estimated slope in the simple linear regression model.
  • The estimated coefficient \(b_2\) indicates the up/down direction of the curve. That is:
    • if \(b_2\) < 0, then the curve is concave down
    • if \(b_2\) > 0, then the curve is concave up

So, here, in this example, the estimated coefficient \(b_0\) = 1632.3 tells us that a male whose maximal oxygen uptake is 50.64 ml/kg is predicted to have 1632.3 mg of immunoglobin in his blood. And, the estimated coefficient \(b_1\) = 34.00 tells us that when an individual's maximal oxygen uptake is near 50.64 ml/kg, we can expect the individual's immunoglobin to increase by 34.00 mg for every 1 ml/kg increase in maximal oxygen uptake.

As the following plot of the estimated quadratic function suggests:

fitted line plot

the reformulated regression function appears to describe the trend in the data well. The adjusted \(R^{2} \text{-value}\) is still 93.3%.

We shouldn't be surprised to see that the estimates of the coefficients in our reformulated polynomial regression model are quite similar to the estimates of the coefficients for the simple linear regression model:

fitted line plot

As you can see, the estimated coefficient \(b_1\) = 34.00 for the polynomial regression model and \(b_1\) = 32.74 for the simple linear regression model. And, the estimated coefficient \(b_0\) = 1632 for the polynomial regression model and \(b_0\) = 1558 for the simple linear regression model. The similarities in the estimates, of course, arise from the fact that the predictors are nearly uncorrelated and therefore the estimates of the coefficients don't change all that much from model to model.

Now, you might be getting this sense that we're "mucking around with the data" to get an answer to our research questions. One way to convince you that we're not is to show you that the two estimated models are algebraically equivalent. That is, if given one form of the estimated model, say the estimated model with the centered predictors:

\(y_i=b_{0}^{*}+b_{1}^{*}x_{i}^{*}+b_{11}^{*}x_{i}^{*2}\)

then, the other form of the estimated model, say the estimated model with the original predictors:

\(y_i=b_{0}+b_{1}x_{i}+b_{11}x_{i}^{2}\)

can be easily obtained. In fact, it can be shown algebraically that the estimated coefficients of the original model equal:

\(\begin{align}
b_{0}&=b_{0}^{*}-b_{1}^{*}\bar{x}+b_{11}^{*}\bar{x}^{2}\\
b_{1}&= b_{1}^{*}-2b_{11}^{*}\bar{x}\\
b_{11}&= b_{11}^{*}\\
\end{align}\)

For example, the estimated regression function for our reformulated model with centered predictors is:

\(\hat{y}_i=1632.3+34.00x_{i}^{*}-0.536x_{i}^{*2}\)

Then, since the mean of the oxygen values in the data set is 50.64:

Descriptive Statistics: oxygen

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
oxygen 30 0 50.64 2.19 11.99 33.00 41.58 50.00 61.63 69.90

it can be shown algebraically that the estimated coefficients for the model with the original (uncentered) predictors are:

\(b_0 = 1632.3 - 34.00(50.64) - 0.536(50.64)^{2} = -1464\)

\(b_1 = 34.00 - 2(- 0.536)(50.64) = 88.3\)

\(b_{11} = - 0.536\)

That is, the estimated regression function for our quadratic polynomial model with the original (uncentered) predictors is:

\(\hat{y}_i=-1464+88.3x_{i}-0.536x_{i}^{2}\)

Given the equivalence of the two estimated models, you might ask why we bother to center the predictors. The main reason for centering to correct structural multicollinearity is that low levels of multicollinearity can help avoid computational inaccuracies. Specifically, a near-zero determinant of \(X^{T}X\) is a potential source of serious roundoff errors in the calculations of the normal equations. Severe multicollinearity has the effect of making this determinant come close to zero. Thus, under severe multicollinearity, the regression coefficients may be subject to large roundoff errors.

Let's use our model to predict the immunoglobin level in the blood of a person whose maximal oxygen uptake is 90 ml/kg. Of course, before we use our model to answer a research question, we should always evaluate it first to make sure it means all of the necessary conditions. The residuals versus fits plot:

residual vs fitted value plot

shows a nice horizontal band around the residual = 0 line, suggesting the model fits the data well. It also suggests that the variances of the error terms are equal. And, the normal probability plot:

normal probability plot

suggests that the error terms are normally distributed. Okay, we're good to go — let's use the model to answer our research question: "What is one's predicted IgG if the maximal oxygen uptake is 90 ml/kg?"

When asking Minitab to make this prediction, you have to remember that we have centered the predictors. That is, if oxygen = 90, then oxcent = 90-50.64 = 39.36. And, if oxcent = 39.36, then oxcentsq = 1549.210. Asking Minitab to predict the igg of an individual whose oxcent = 39.4 and oxcentsq = 1549, we obtain the following output:

Regression Equation

\(\widehat{igg} = 1632.3 + 34.00 oxcent - 0.536 oxcentsq\)

variable Setting
oxcent 39.4
oxcentsq 1549
Fit SE Fit 95% CI 95% PI
2141.10 219.209 (1691.32, 2590.88) (1641.12, 2641.09) XX

XX denotes an extremely unusual point relative to predictor levels used to fit the model

Why does Minitab report that "XX denotes a row with very extreme X values?" Recall that the levels of maximal oxygen uptake in the data set ranged from 30 to 70 ml/kg. Therefore, a maximal oxygen uptake of 90 is way outside the scope of the model, and Minitab provides such a warning.

A word of warning. Be careful — because of changes in direction of the curve, there is an even greater danger of extrapolation when modeling data with a polynomial function.

The hierarchical approach to model fitting Section

Just one closing comment since we've been discussing polynomial regression to remind you about the "hierarchical approach to model fitting." The widely accepted approach to fitting polynomial regression functions is to fit a higher-order model and then explore whether a lower-order (simpler) model is adequate. For example, suppose we formulate the following cubic polynomial regression function:

\(y_i=\beta_{0}+\beta_{1}x_{i}+\beta_{11}x_{i}^{2}+\beta_{111}x_{i}^{3}+\epsilon_i\)

Then, to see if the simpler first-order model (a "line") is adequate in describing the trend in the data, we could test the null hypothesis:

\(H_0 \colon \beta_{11}=\beta_{111}=0\)

But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained. That is, if a quadratic term is deemed significant, then it is standard practice to use this regression function:

\(\mu_Y=\beta_{0}+\beta_{1}x_{i}+\beta_{11}x_{i}^{2}\)

and not this one:

\(\mu_Y=\beta_{0}+\beta_{11}x_{i}^{2}\)

That is, we always fit the terms of a polynomial model in a hierarchical manner.