Lesson 31: Multiple Regression Analysis
Lesson 31: Multiple Regression AnalysisOverview
In this lesson, we'll explore using the REG procedure for performing regression analyses with a continuous response variable and more than one predictor variable. We'll also explore using the LOGISTIC procedure for performing logistic regression analyses with a binary response variable and one or more predictor variables.
Objectives
- use the REG procedure to analyze data arising from a designed experiment
- use the REG procedure to analyze data arising from nonexperimental research
- use the variable selection methods available in the REG procedure to find the "best" model (or models!) for a set of data
- create and use dummy variables in a regression model
- use the variance inflation factor to look for multicollinearity
- use the LOGISTIC procedure to analyze regression data having a response variable with just two levels (such as yes/no or dead/alive)
- capture the necessary output from the LOGISTIC procedure in order to create a receiver operating characteristic (ROC) curve
Textbook Reference
Chapter 9 of the textbook.
31.1 - Lesson Notes
31.1 - Lesson NotesB. Designed Regression
It is worth reinforcing the comment on page 287, that the adjusted r-square is the more appropriate measure of the amount of variation explained by a regression with more than one explanatory variable.
D. Stepwise and Other Variable Selection Methods
To drive home the point about how difficult it might be to select the best model when you have a number of predictor variables, let's make it concrete, and suppose we have five possible predictor variables, a, b, c, d, and e. In that case, there'd be as many as 31 different regression models we'd have to try out on our data:
- 5 models with just one predictor variable: a, b, c, d, and e
- 10 models with two predictor variables: ab, bc, bd, be, ac, ad, ae, cd, ce, and de
- 10 models with three predictor variables: abc, abd, abe, acd, ace, ade, bcd, bce, bde, and cde
- 5 models with four predictor variables: abcd, abce, abde, acde, and bcde
- 1 model with all five predictor variables: abcde
Page 295. Did we learn anything from this study? The best predictor of reading achievement at the end of the sixth grade is reading achievement at the end of the fifth grade. Hmmm.
G. Logistic Regression
Page 309. Sensitivity and specificity rates are typically used in quantifying the value of a diagnostic test. Sensitivity is defined as ... given that a person has a disease, what is the probability that the diagnostic test will detect the disease? Specificity is defined as ... given that a person is healthy, what is the probability that the diagnostic test will indicate that the person is healthy? Based on these definitions, it becomes clear that we desire the highest sensitivity and specificity rates that we can get. As you can see, though, on the classification table on page 307, the two values play off of each other. That is, as sensitivity increases, specificity generally decreases. The goal is to find the point at which we can live with the sensitivity and specificity (or find another diagnostic test!). In the example here, the authors are suggesting making the cutoff 0.3, so that the sensitivity is high (92%), but the specificity is not too low (45%).
31.2 - Summary
31.2 - SummaryIn this lesson, we learned how to use the REG procedure to analyze regression data having a continuous response variable and more than one predictor variable. We also learned how to use the LOGISTIC procedure to analyze regression data having a binary response variable and one or more predictor variables.
The homework for this lesson will give you more practice with multiple regression and logistic regression analyses.