# 12.7 - Model-Based Methods: Building a Model

12.7 - Model-Based Methods: Building a ModelRegression/ANCOVA models as described above are most useful when

- they contain a few clinically relevant and interpretable prognostic variables
- the parameters or coefficients are estimated with relatively high precision
- the prognostic factors each carry independent information about the outcome variable
- the model is consistent with other clinical and biological data

For a given situation, however, it may not be easy to construct a model that satisfies these criteria.

With available statistical software in modern computers, portions of the model-building process are automatic. Caution must be exercised, however, for the following reasons.

- The criteria employed by the software may be inappropriate, e.g., relying solely on
*p*-values. - There are poor statistical properties when performing a large number of tests and refitting models.
- It is not possible to incorporate outside information into the model-building process.
- The software may not handle the problem of missing data very well.

The model-building process requires thought and an understanding of the clinical situation. Some statisticians only use prognostic variables in the model for which there exist plausible biological reasons for their inclusion.

## Approaches

Computer software to assist in the construction and evaluation of a model follows several approaches.

One approach is called a **step-up** or** forward selection process**, in which the initial model contains no regressors but they enter the model one at a time. In this situation, a regressor enters the model if its *p*-value is less than a critical value, say 0.05.

Another approach is called the **step-down** or **backward selection process**, in which the initial model contains all of the regressors. In this situation, a regressor is eliminated from the model if its *p*-value is not less than the critical value.

A third approach, called **stepwise selection**, is a modification of forward selection. In this situation, after a new variable enters the model, all the variables that had entered the model previously are reexamined to see if their *p*-values have changed. If any of the revised *p*-values exceed the critical value, then the corresponding variables are eliminated from the model.

A fourth approach involves finding the best one-variable-model, the best two-variable model, etc. with the help of software, and then **using judgment as to which is the best overall model**, i.e., if the (c+1)-variable model is only slightly better than the c-variable model, the latter is selected. It is prudent to attempt a variety of models and approaches to determine if the results are consistent.

Some statisticians favor the backward selection or step-down process, although there is no universal agreement among statisticians. It is not unusual for a particular data set to discover that step-up and step-down selection algorithms lead to different models. The main reason for this is that the regressors/covariates are not completely independent of each other.

When a variable is entered into or removed from a model, the *p*-values of the other variables will change. Consider a linear model with two potential regressors, \(X_1\) and \(X_2\), and suppose that they are strongly correlated (“independent variables” is a misnomer). Suppose that in a model with \(X_1\) only, \(X_1\) is significant, and in a model with \(X_2\) only, \(X_2\) is significant. When a model is constructed with both \(X_1\) and \(X_2\), however, the contribution by \(X_2\) to the model is no longer statistically significant. Because \(X_1\) and \(X_2\) are strongly correlated, \(X_2\) has very little predictive power when \(X_1\) already is in the model.

Initial screening of the entire set of candidate regressors/covariates is advised. Many statisticians recommend that each potential regressor be examined individually in a simple model. This can help identify potential regressors for which there is not a strong biological justification.

Usually, the critical significance level in this first-stage approach is more lenient, say 0.10 or 0.15. Then all of the regressors that meet this first-stage criterion and/or that have biological/clinical justification comprise the set of regressors that are subjected to the model-building process. Clinical input always should augment this first-stage process.