# 10.7 - One Model Building Strategy

10.7 - One Model Building Strategy

We've talked before about the "art" of model building. Unsurprisingly, there are many approaches to model building, but here is one strategy — consisting of seven steps — that are commonly used when building a regression model.

1. ## The First Step

Decide on the type of model that is needed to achieve the goals of the study. In general, there are five reasons one might want to build a regression model. They are:

• For predictive reasons — that is, the model will be used to predict the response variable from a chosen set of predictors.
• For theoretical reasons — that is, the researcher wants to estimate a model based on a known theoretical relationship between the response and predictors.
• For control purposes — that is, the model will be used to control a response variable by manipulating the values of the predictor variables.
• For inferential reasons — that is, the model will be used to explore the strength of the relationships between the response and the predictors.
• For data summary reasons — that is, the model will be used merely as a way to summarize a large set of data by a single equation.
2. ## The Second Step

Decide which predictor variables and response variables on which to collect the data. Collect the data.

3. ## The Third Step

Explore the data. That is:

• On a univariate basis, check for outliers, gross data errors, and missing values.
• Study bivariate relationships to reveal other outliers, suggest possible transformations, and identify possible multicollinearities.

I can't possibly overemphasize the importance of this step. There's not a data analyst out there who hasn't made the mistake of skipping this step and later regretting it when a data point was found in error, thereby nullifying hours of work.

4. ## The Fourth Step

Randomly divide the data into a training set and a validation set:

• The training set, with at least 15-20 error degrees of freedom, is used to estimate the model.
• The validation set is used for cross-validation of the fitted model.
5. ## The Fifth Step

Using the training set, identify several candidate models:

• Use best subsets regression.
• Use stepwise regression, which of course only yields one model unless different alpha-to-remove and alpha-to-enter values are specified.
6. ## The Sixth Step

Select and evaluate a few "good" models:

• Select the models based on the criteria we learned, as well as the number and nature of the predictors.
• Evaluate the selected models for violation of the model conditions.
• If none of the models provide a satisfactory fit, try something else, such as collecting more data, identifying different predictors, or formulating a different type of model.
7. ## The Seventh (and final step)

Select the final model:

• Compare the competing models by cross-validating them against the validation data.
• The model with a smaller mean square prediction error (or larger cross-validation $$R^{2}$$) is a better predictive model.
• Consider residual plots, outliers, parsimony, relevance, and ease of measurement of predictors.

And, most of all, don't forget that there is not necessarily only one good model for a given set of data. There might be a few equally satisfactory models.

  Link ↥ Has Tooltip/Popover Toggleable Visibility