We've talked before about the "art" of model building. Unsurprisingly, there are many approaches to model building, but here is one strategy — consisting of seven steps — that is commonly used when building a regression model.

**The First Step**Decide on the type of model that is needed in order to achieve the goals of the study. In general, there are five reasons one might want to build a regression model. They are:

- For
**predictive**reasons — that is, the model will be used to predict the response variable from a chosen set of predictors. - For
**theoretical**reasons — that is, the researcher wants to estimate a model based on a known theoretical relationship between the response and predictors. - For
**control**purposes — that is, the model will be used to control a response variable by manipulating the values of the predictor variables. - For
**inferential**reasons — that is, the model will be used to explore the strength of the relationships between the response and the predictors. - For
**data summary**reasons — that is, the model will be used merely as a way to summarize a large set of data by a single equation.

- For
**The Second Step**Decide which predictor variables and response variable on which to collect the data. Collect the data.

**The Third Step**Explore the data. That is:

- On a univariate basis, check for outliers, gross data errors, and missing values.
- Study bivariate relationships to reveal other outliers, to suggest possible transformations, and to identify possible multicollinearities.

I can't possibly over-emphasize the importance of this step. There's not a data analyst out there who hasn't made the mistake of skipping this step and later regretting it when a data point was found in error, thereby nullifying hours of work.

**The Fourth Step**Randomly divide the data into a training set and a validation set:

- The
**training set**, with at least 15-20 error degrees of freedom, is used to estimate the model. - The
**validation set**is used for cross-validation of the fitted model.

- The
**The Fifth Step**Using the training set, identify several candidate models:

- Use best subsets regression.
- Use stepwise regression, which of course only yields one model unless different alpha-to-remove and alpha-to-enter values are specified.

**The Sixth Step**Select and evaluate a few "good" models:

- Select the models based on the criteria we learned, as well as the number and nature of the predictors.
- Evaluate the selected models for violation of the model conditions.
- If none of the models provide a satisfactory fit, try something else, such as collecting more data, identifying different predictors, or formulating a different type of model.

**The Seventh (and final step)**Select the final model:

- Compare the competing models by cross-validating them against the validation data.
- The model with a smaller mean square prediction error (or larger cross-validation \(R^{2}\)) is a better predictive model.
- Consider residual plots, outliers, parsimony, relevance, and ease of measurement of predictors.

And, most of all, don't forget that there is not necessarily only

**one**good model for a given set of data. There might be a few equally satisfactory models.