Lesson 31: Multiple Regression Analysis

Overview

In this lesson, we'll explore using the REG procedure for performing regression analyses with a continuous response variable and more than one predictor variable. We'll also explore using the LOGISTIC procedure for performing logistic regression analyses with a binary response variable and one or more predictor variables.

Objectives

Upon completion of this lesson, you should be able to:

use the REG procedure to analyze data arising from a designed experiment
use the REG procedure to analyze data arising from nonexperimental research
use the variable selection methods available in the REG procedure to find the "best" model (or models!) for a set of data
create and use dummy variables in a regression model
use the variance inflation factor to look for multicollinearity
use the LOGISTIC procedure to analyze regression data having a response variable with just two levels (such as yes/no or dead/alive)
capture the necessary output from the LOGISTIC procedure in order to create a receiver operating characteristic (ROC) curve

Textbook Reference

Chapter 9 of the textbook.

31.1 - Lesson Notes

B. Designed Regression

It is worth reinforcing the comment on page 287, that the adjusted r-square is the more appropriate measure of the amount of variation explained by a regression with more than one explanatory variable.

D. Stepwise and Other Variable Selection Methods

To drive home the point about how difficult it might be to select the best model when you have a number of predictor variables, let's make it concrete, and suppose we have five possible predictor variables, a, b, c, d, and e. In that case, there'd be as many as 31 different regression models we'd have to try out on our data:

5 models with just one predictor variable: a, b, c, d, and e
10 models with two predictor variables: ab, bc, bd, be, ac, ad, ae, cd, ce, and de
10 models with three predictor variables: abc, abd, abe, acd, ace, ade, bcd, bce, bde, and cde
5 models with four predictor variables: abcd, abce, abde, acde, and bcde
1 model with all five predictor variables: abcde

Page 295. Did we learn anything from this study? The best predictor of reading achievement at the end of the sixth grade is reading achievement at the end of the fifth grade. Hmmm.

G. Logistic Regression

Page 309. Sensitivity and specificity rates are typically used in quantifying the value of a diagnostic test. Sensitivity is defined as ... given that a person has a disease, what is the probability that the diagnostic test will detect the disease? Specificity is defined as ... given that a person is healthy, what is the probability that the diagnostic test will indicate that the person is healthy? Based on these definitions, it becomes clear that we desire the highest sensitivity and specificity rates that we can get. As you can see, though, on the classification table on page 307, the two values play off of each other. That is, as sensitivity increases, specificity generally decreases. The goal is to find the point at which we can live with the sensitivity and specificity (or find another diagnostic test!). In the example here, the authors are suggesting making the cutoff 0.3, so that the sensitivity is high (92%), but the specificity is not too low (45%).

31.2 - Summary

In this lesson, we learned how to use the REG procedure to analyze regression data having a continuous response variable and more than one predictor variable. We also learned how to use the LOGISTIC procedure to analyze regression data having a binary response variable and one or more predictor variables.

The homework for this lesson will give you more practice with multiple regression and logistic regression analyses.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility