Lesson #8: Overview of Multiple Linear Regression

Summary

Whew! Beside reviewing some matrix algebra and seeing how it can be used to formulate multiple regression models, we've taken a look at five different flavors of research studies involving multiple linear regression analysis. We gleaned what we could from the analyses knowing what we know from our study of the simple linear regression model. Hopefully, you now appreciate many of the similarities in simple linear regression analyses and multiple linear regression analyses. In future lessons, we will focus on some of the differences. In particular, here's a list of the new things we'll learn on our way to becoming experts in multiple linear regression analyses:

Comprehensive Exercises

Directions. Type up your answer the following question in a Word file named exercises08_yourPSUid.doc. Once you have completed the exercise, upload your file to the Lesson #8 Comprehensive Exercises dropbox.


8.1 Are brain size and body size predictive of intelligence?

The iqsize.txt data set contains data on the intelligence based on the performance IQ (y = PIQ) scores from the revised Wechsler Adult Intelligence Scale, brain size ( x1 = brain) based on the count from MRI scans (given as count/10000), and body size measured by height in inches ( x2 = height) and weight in pounds ( x3 = weight) on 38 college students.

  1. Create a scatter plot matrix of all four variables in order to get a feel for the "marginal relationships," that is, the relationships between each pair of variables. Which, if any, of the marginal relationships appear strong? linear? non-existent?
  2. A correlation matrix summarizes the correlations among each pair of variables. Create a correlation matrix of all four of our variables. Are the correlations consistent with the scatter plots? Based on the correlations, which of the marginal relationships appear strongest? weakest?
  3. Let's formulate a model that contains all three predictors. That is, let's consider the model:
  4. yi = β0 + β1xi1 + β2xi2 + β3xi3 + εi

    (with independent, normally distributed error terms and equal variances).

    1. Use the Minitab Regression command to fit (that is, to estimate) the multiple linear regression model with y = PIQ as the response and x1 = brain and x2 = height and x3 = weight as simultaneous predictors.
    2. With minor modification, the parameters in a multiple regression model have similar meaning as the parameters in a simple regression model. The parameter β0 is the mean response when all of the predictors have value 0 — β0 is therefore meaningless when 0 is notinthe range of all of the predictor variables.The slope parameter β1 indicates the change in the mean response μY for each unit increase in x1 when the other predictors x2 and x3 are held constant. The other slope parameters β2 and β3 areinterpreted similarly. Interpret each of the estimated regression coefficients for this data set.
    3. The R2 value for a multiple regression model has a similar interpretation as the simple regression model — namely, it quantifies what proportion of the variation in the response y is reduced by (or "explained by") taking into account the predictors that are in the model. What is the R2 value here?
    4. We'll learn soon below that there are limitations to the R2 value in a multiple regression setting. Therefore, we typically rely instead on the "adjusted R2 value." What is the adjusted R2 value here?
    5. Use the t-test's P-value to test that, once brain and height reducethe variability in PIQ, the variable weight doesn't help to reduce any of the remaining variability in PIQ. That is, use the t-test to test the hypothesis that H0 : β3 = 0 against the alternative HA : β3 ≠ = 0.
    6. Use b3 ± (t0.025, n - 4) × se(b3) to calculate a 95% confidence interval for β3. (Note that the degrees of freedom in the simple linear model was n - 2. Here, it is n - 4. In general, it is n - p, where p is the number of parameters in your model.) Is your confidence interval consistentwith the hypothesis test you conducted?
  5. Now, let's formulate a model that contains just two ofthe predictors.That is, let's consider the model: yi = β0 + β1xi1 + β2xi2 + εi (with independent, normally distributed error terms andequal variances).
    1. Use the Minitab Regression command to fit the multiple linear regression model with y= PIQ as the response and x1 = brain and x2 = height as simultaneous predictors.
    2. Now, what is the R2 value?
    3. And, now what is the adjusted R2 value?
    4. Does either the R2 value or the adjusted R2 value differ from the values obtained when all three predictors were included in the model? The problem with the R2 value as a summary measure is that adding more predictor variables can only increase R2 because SSTO is always the same for a given set of data. However, the remaining error (quantified by SSE) can only get smaller (or stay the same) when more predictor variables are considered.The adjusted R2 value, which is defined as:
    5. is designed to correct the problem with the R2 value bytaking into account the number of predictors (through the number of parameters p) in the model. The interpretation of the adjusted R2 value is the same as the interpretation of the R2 value. That is, adjusted R2 quantifies what proportion of the variation in the response y is reduced by (or "explained by") taking into account the predictors that are in the model.

    6. With respect to the adjusted R2 value, which model is better — the model with weight included or the model with weight excluded?
  6. Re-fit the model with y= PIQ as the response and x1 = brain and x2 = height as predictors, and perform standard regression diagnostics on the residuals to see if our model assumptions are met.
    1. Request a residuals vs. fits plot, a normal probability plot, and a residuals vs. weight plot.
    2. Interpret each plot.
    3. Is your interpretation of the residuals vs. weight plot consistent with previous analysis results?
  7. Re-fit the model with y = PIQ as the response and x1 = brain and x2 = height as predictors, and request a lack of fit test. Look at the raw data to try to determine why Minitab says there are no replicates, and therefore cannot perform the test.
  8. Now that you're confident that none of the model assumptions are (severely) violated, use the multiple linear regression model with y = PIQ as the response and x1 = brain and x2 = height as the predictors to predict a future student's PIQ.
    1. As always, we have to make sure that our predictor values are in the scope of the model. In the multiple regression setting, a data point falls in the scope of the model if it falls in the "elliptical point cloud" created by the predictor values. Create a scatter plot of brain and height. Does height = 62 and brain = 110 fall in the scope of the model? Does height = 75 and brain = 80 fall in the scope of the model? Does height = 70 and brain = 95 fall in the scope of the model?
    2. Use the Options... subcommand of the Regression command to predict, with 95% confidence, the PIQ of a future student whose height = 70 and brain = 95. (Make sure you enter the 70 and the 95 in the same order in which you entered the variables in the model.)

4 Effect of linearly dependent variables in the X matrix

This exercise is designed to illustrate problems that can occur when the variables in the X matrix are linearly
dependent. It also illustrates how regression methods can be used to analyze the relationship between a continuous response variable and predictor variables which are all categorical. (This type of analysis would typically be conducted under the "analysis of variance" framework.)

4.1 Reaction times in sleep-deprived

The data set deprived.txt contains data on the reaction times (y = time, in hundredths of a second) to the onset of light of subjects in four sleep-deprived groups (Kirk, 1995). The four groups were sleep-deprived for 12 hours (the control group, Grp12), for 18 hours (Grp18), for 24 hours (Grp24), and for 30 hours (Grp30). Each group contained 8 subjects.

  1. Import the data set into Minitab. In doing so, note the "coding" that was used to distinguish between the groups. In short, the sub ject's value is 1 for the group variable to which he belonged and 0 for the other variables. For example, Grp12 = 1 and Grp18 = Grp24 = Grp30 = 0 for a subject who was sleep-deprived for 12 hours, whereas Grp12 = Grp18 = Grp24 = 0 and Grp30 = 1 for a subject who was sleep-deprived for 30 hours.
  2. Try fitting a linear regression model with y = time and Grp12; Grp18; Grp24; and Grp30 as predictor
    variables. What happens? Taking into account what the X matrix looks like, why does this happen?
  3. Using the reduced model that Minitab fits and using the coding scheme, what is the predicted reaction time for an individual who is sleep-deprived for 12 hours? for 18 hours? for 24 hours? and for 30 hours?

Title of this problem set here...

Some researchers were interested in studying the relationship between behavior types of individuals and cholesterol levels. In general, Type A behavior is characterized by urgency, aggression and ambition, while type B behavior is relaxed, non-competitive, and less hurried. The behavior.txt data set contains the researchers' data as they entered it for analysis in Minitab. When the researchers tried to fit the model with chol as the response and typeA and typeB as the predictors, they obtained the following error message:

minitab output

  1. Identify what is wrong with the way the researchers tried to analyze their data.
  2. Can the researchers use Minitab's estimated regression equation to estimate the mean cholesterol level for Type A individuals? for Type B individuals? If so, what are the estimates?

© 2004 The Pennsylvania State University. All rights reserved.
Materials developed by Dr. Laura J. Simon (Lecturer, Penn State Department of Statistics).