11.8 - One Model Building Strategy

We've talked before about the "art" of model building. Unsurprisingly, there are many approaches to model building, but here is one strategy—consisting of seven steps—that is commonly used when building a regression model.

The first step

Decide on the type of model that is needed in order to achieve the goals of the study. In general, there are five reasons one might want to build a regression model. They are:

  • For predictive reasons — that is, the model will be used to predict the response variable from a chosen set of predictors.
  • For theoretical reasons — that is, the researcher wants to estimate a model based on a known theoretical relationship between the response and predictors.
  • For control purposes — that is, the model will be used to control a response variable by manipulating the values of the predictor variables.
  • For inferential reasons — that is, the model will be used to explore the strength of the relationships between the response and the predictors.
  • For data summary reasons — that is, the model will be used merely as a way to summarize a large set of data by a single equation.

The second step

Decide which predictor variables and response variable on which to collect the data. Collect the data.

The third step

Explore the data. That is:

  • On a univariate basis, check for outliers, gross data errors, and missing values.
  • Study bivariate relationships to reveal other outliers, to suggest possible transformations, and to identify possible multicollinearities.

I can't possibly over-emphasize the importance of this step. There's not a data analyst out there who hasn't made the mistake of skipping this step and later regretting it when a data point was found in error, thereby nullifying hours of work.

The fourth step

Randomly divide the data into a training set and a validation set:

  • The training set, with at least 15-20 error degrees of freedom, is used to estimate the model.
  • The validation set is used for cross-validation of the fitted model.

The fifth step

Using the training set, identify several candidate models:

  • Use best subsets regression.
  • Use stepwise regression, which of course only yields one model unless different alpha-to-remove and alpha-to-enter values are specified.

The sixth step

Select and evaluate a few "good" models:

  • Select the models based on the criteria we learned, as well as the number and nature of the predictors.
  • Evaluate the selected models for violation of the model conditions.
  • If none of the models provide a satisfactory fit, try something else, such as collecting more data, identifying different predictors, or formulating a different type of model.

The seventh and final step

Select the final model:

  • Compare the competing models by cross-validating them against the validation data.
  • The model with a smaller mean square prediction error (or larger cross-validation R2) is a better predictive model.
  • Consider residual plots, outliers, parsimony, relevance, and ease of measurement of predictors.

And, most of all, don't forget that there is not necessarily only one good model for a given set of data. There might be a few equally satisfactory models.