10.7  One Model Building Strategy
10.7  One Model Building StrategyWe've talked before about the "art" of model building. Unsurprisingly, there are many approaches to model building, but here is one strategy — consisting of seven steps — that are commonly used when building a regression model.

The First Step
Decide on the type of model that is needed to achieve the goals of the study. In general, there are five reasons one might want to build a regression model. They are:
 For predictive reasons — that is, the model will be used to predict the response variable from a chosen set of predictors.
 For theoretical reasons — that is, the researcher wants to estimate a model based on a known theoretical relationship between the response and predictors.
 For control purposes — that is, the model will be used to control a response variable by manipulating the values of the predictor variables.
 For inferential reasons — that is, the model will be used to explore the strength of the relationships between the response and the predictors.
 For data summary reasons — that is, the model will be used merely as a way to summarize a large set of data by a single equation.

The Second Step
Decide which predictor variables and response variables on which to collect the data. Collect the data.

The Third Step
Explore the data. That is:
 On a univariate basis, check for outliers, gross data errors, and missing values.
 Study bivariate relationships to reveal other outliers, suggest possible transformations, and identify possible multicollinearities.
I can't possibly overemphasize the importance of this step. There's not a data analyst out there who hasn't made the mistake of skipping this step and later regretting it when a data point was found in error, thereby nullifying hours of work.

The Fourth Step
Randomly divide the data into a training set and a validation set:
 The training set, with at least 1520 error degrees of freedom, is used to estimate the model.
 The validation set is used for crossvalidation of the fitted model.

The Fifth Step
Using the training set, identify several candidate models:
 Use best subsets regression.
 Use stepwise regression, which of course only yields one model unless different alphatoremove and alphatoenter values are specified.

The Sixth Step
Select and evaluate a few "good" models:
 Select the models based on the criteria we learned, as well as the number and nature of the predictors.
 Evaluate the selected models for violation of the model conditions.
 If none of the models provide a satisfactory fit, try something else, such as collecting more data, identifying different predictors, or formulating a different type of model.

The Seventh (and final step)
Select the final model:
 Compare the competing models by crossvalidating them against the validation data.
 The model with a smaller mean square prediction error (or larger crossvalidation \(R^{2}\)) is a better predictive model.
 Consider residual plots, outliers, parsimony, relevance, and ease of measurement of predictors.
And, most of all, don't forget that there is not necessarily only one good model for a given set of data. There might be a few equally satisfactory models.