Missing values could cause a problem during the model-building process if subjects display different patterns of missingness for the set of regressors/covariates. For example, consider the aforementioned linear model with two potential regressors, \(X_1\) and \(X_2\). Suppose that there are 100 subjects, but 50 are missing \(X_1\) and the remaining 50 are missing \(X_2\). Thus, no subject has \(X_1\) and \(X_2\) observed simultaneously, so a model with both regressors is not possible. This is an extreme case, but most model-building endeavors encounter some form of missingness.
Hopefully, missing data among the regressors/covariates are not related to the outcome. If this is not the case, then it may not be possible to develop a model that is unbiased. For example, if the patients with the most severe form of the disease are the ones with missing values for the regressors/covariates, then the resultant model that does not include these patients will be biased.
As has been discussed earlier, data imputation is one way to handle the situation of missing values. Data imputation involves the estimation of the missing values in a manner that is consistent and then “imputing” the estimated values for the missing values. Thus, every subject will have a complete set of regressors/covariates and the statistical analysis can proceed without eliminating any subjects.
The values to be imputed can be estimated by averaging over the observed values or by fitting regression models in which the regressors with missing values become the outcome variables. Obviously, there is some danger of introducing large biases with imputation, so it must be performed carefully on a case-by-case basis.
Make every effort to collect complete data to avoid such problems. When data are missing, be certain to report the numbers of patients used in each analysis and any methods used to impute missing values.