We've talked before about the "art" of model building. Unsurprisingly, there are many approaches to model building, but here is one strategy — consisting of seven steps — that are commonly used when building a regression model.
-
The First Step
Decide on the type of model that is needed to achieve the goals of the study. In general, there are five reasons one might want to build a regression model. They are:
- For predictive reasons — that is, the model will be used to predict the response variable from a chosen set of predictors.
- For theoretical reasons — that is, the researcher wants to estimate a model based on a known theoretical relationship between the response and predictors.
- For control purposes — that is, the model will be used to control a response variable by manipulating the values of the predictor variables.
- For inferential reasons — that is, the model will be used to explore the strength of the relationships between the response and the predictors.
- For data summary reasons — that is, the model will be used merely as a way to summarize a large set of data by a single equation.
-
The Second Step
Decide which predictor variables and response variables on which to collect the data. Collect the data.
-
The Third Step
Explore the data. That is:
- On a univariate basis, check for outliers, gross data errors, and missing values.
- Study bivariate relationships to reveal other outliers, suggest possible transformations, and identify possible multicollinearities.
I can't possibly overemphasize the importance of this step. There's not a data analyst out there who hasn't made the mistake of skipping this step and later regretting it when a data point was found in error, thereby nullifying hours of work.
-
The Fourth Step
Randomly divide the data into a training set and a validation set:
- The training set, with at least 15-20 error degrees of freedom, is used to estimate the model.
- The validation set is used for cross-validation of the fitted model.
-
The Fifth Step
Using the training set, identify several candidate models:
- Use best subsets regression.
- Use stepwise regression, which of course only yields one model unless different alpha-to-remove and alpha-to-enter values are specified.
-
The Sixth Step
Select and evaluate a few "good" models:
- Select the models based on the criteria we learned, as well as the number and nature of the predictors.
- Evaluate the selected models for violation of the model conditions.
- If none of the models provide a satisfactory fit, try something else, such as collecting more data, identifying different predictors, or formulating a different type of model.
-
The Seventh (and final step)
Select the final model:
- Compare the competing models by cross-validating them against the validation data.
- The model with a smaller mean square prediction error (or larger cross-validation \(R^{2}\)) is a better predictive model.
- Consider residual plots, outliers, parsimony, relevance, and ease of measurement of predictors.
And, most of all, don't forget that there is not necessarily only one good model for a given set of data. There might be a few equally satisfactory models.