Lesson 11: Model Building

Overview of this Lesson

For all of the regression analyses that we have performed so far in this course, it has been obvious which of the major predictors we should include in our regression model. Unfortunately, this is typically not the case. More often than not, a researcher has a large set of candidate predictor variables from which to try to identify the most appropriate predictors to include in the regression model.

Of course, the larger the number of candidate predictor variables, the larger the number of possible regression models. For example, if a researcher has (only) 10 candidate predictor variables, there are 210 = 1024 possible regression models from which to choose. Clearly, some assistance would be needed in evaluating all of the possible regression models. That's where two variable selection methods — stepwise regression and best subsets regression — come in handy.

In this lesson, we'll learn about the above two variable selection methods. Our goal throughout will be to choose a small subset of predictors from the larger set of candidate predictors so that the resulting regression model is simple yet useful. That is, as always, our resulting regression model should:

  • provide a good summary of the trend in the response,
  • provide good predictions of the response, and
  • provide good estimates of the slope coefficients.

Note. The data sets herein are not really all that large. For the sake of illustration, they necessarily have to be small, so that the largeness of the data set does not obscure the pedagogical point being made.

Key Learning Goals for this Lesson:
  • Understand the impact of the four different kinds of models with respect to their "correctness" — correctly specified, underspecified, overspecified, and correct but with extraneous predictors.
  • As a way of ensuring that you understand the general idea behind stepwise regression, be able to conduct stepwise regression "by hand."
  • Know the limitations of stepwise regression.
  • Know the general idea behind best subsets regression.
  • Know how to choose an optimal model based on the R2 value, the adjusted R2 value, MSE and the Cp criterion.
  • Know the limitations of best subsets regression.
  • Know the general principles behind good model building strategies.