Lesson 10: Model Building

Overview Section

For all of the regression analyses that we performed so far in this course, it has been obvious which of the major predictors we should include in our regression model. Unfortunately, this is typically not the case. More often than not, a researcher has a large set of candidate predictor variables from which he tries to identify the most appropriate predictors to include in his regression model.

Of course, the larger the number of candidate predictor variables, the larger the number of possible regression models. For example, if a researcher has (only) 10 candidate predictor variables, he has \(2^{10} = 1024\) possible regression models from which to choose. Clearly, some assistance would be needed in evaluating all of the possible regression models. That's where the two variable selection methods — stepwise regression and best subsets regression — come in handy.

In this lesson, we'll learn about the above two variable selection methods. Our goal throughout will be to choose a small subset of predictors from the larger set of candidate predictors so that the resulting regression model is simple yet useful. That is, as always, our resulting regression model should:

  • provide a good summary of the trend in the response, and/or
  • provide good predictions of the response, and/or
  • provide good estimates of the slope coefficients.
Note! The data sets herein are not really all that large. For the sake of illustration, they necessarily have to be small, so that the largeness of the data set does not obscure the pedagogical point being made.

Objectives

Upon completion of this lesson, you should be able to:

  • Understand the impact of the four different models concerning their "correctness" — correctly specified, underspecified, overspecified, and correct but with extraneous predictors.
  • To ensure that you understand the general idea behind stepwise regression, be able to conduct stepwise regression "by hand."
  • Know the limitations of stepwise regression.
  • Know the general idea behind best subsets regression.
  • Know how to choose an optimal model based on the \(R^{2}\) value, the adjusted \(R^{2}\) value, MSE and the \(C_p\) criterion.
  • Know the limitations of best subsets regression.
  • Know the seven steps of an excellent model-building strategy.

Lesson 10 Code Files Section

Below is a zip file that contains all the data sets used in this lesson:

STAT501_Lesson10.zip

  • bloodpress.txt
  • cement.txt
  • iqsize.txt
  • martian.txt
  • peru.txt
  • Physical.txt