10.7 - One Model Building Strategy

We've talked before about the "art" of model building. Unsurprisingly, there are many approaches to model building, but here is one strategy — consisting of seven steps — that are commonly used when building a regression model.

  1. The First Step

    Decide on the type of model that is needed to achieve the goals of the study. In general, there are five reasons one might want to build a regression model. They are:

    • For predictive reasons — that is, the model will be used to predict the response variable from a chosen set of predictors.
    • For theoretical reasons — that is, the researcher wants to estimate a model based on a known theoretical relationship between the response and predictors.
    • For control purposes — that is, the model will be used to control a response variable by manipulating the values of the predictor variables.
    • For inferential reasons — that is, the model will be used to explore the strength of the relationships between the response and the predictors.
    • For data summary reasons — that is, the model will be used merely as a way to summarize a large set of data by a single equation.
  2. The Second Step

    Decide which predictor variables and response variables on which to collect the data. Collect the data.

  3. The Third Step

    Explore the data. That is:

    • On a univariate basis, check for outliers, gross data errors, and missing values.
    • Study bivariate relationships to reveal other outliers, suggest possible transformations, and identify possible multicollinearities.

    I can't possibly overemphasize the importance of this step. There's not a data analyst out there who hasn't made the mistake of skipping this step and later regretting it when a data point was found in error, thereby nullifying hours of work.

  4. The Fourth Step

    Randomly divide the data into a training set and a validation set:

    • The training set, with at least 15-20 error degrees of freedom, is used to estimate the model.
    • The validation set is used for cross-validation of the fitted model.
  5. The Fifth Step

    Using the training set, identify several candidate models:

    • Use best subsets regression.
    • Use stepwise regression, which of course only yields one model unless different alpha-to-remove and alpha-to-enter values are specified.
  6. The Sixth Step

    Select and evaluate a few "good" models:

    • Select the models based on the criteria we learned, as well as the number and nature of the predictors.
    • Evaluate the selected models for violation of the model conditions.
    • If none of the models provide a satisfactory fit, try something else, such as collecting more data, identifying different predictors, or formulating a different type of model.
  7. The Seventh (and final step)

    Select the final model:

    • Compare the competing models by cross-validating them against the validation data.
    • The model with a smaller mean square prediction error (or larger cross-validation \(R^{2}\)) is a better predictive model.
    • Consider residual plots, outliers, parsimony, relevance, and ease of measurement of predictors.

    And, most of all, don't forget that there is not necessarily only one good model for a given set of data. There might be a few equally satisfactory models.