12.9 - Other Regression Pitfalls

Nonconstant Variance Section

Excessive nonconstant variance can create technical difficulties with a multiple linear regression model. For example, if the residual variance increases with the fitted values, then prediction intervals will tend to be wider than they should be at low fitted values and narrower than they should be at high fitted values. Some remedies for refining a model exhibiting excessive nonconstant variance include the following:

Apply a variance-stabilizing transformation to the response variable, for example, a logarithmic transformation (or a square root transformation if a logarithmic transformation is "too strong" or a reciprocal transformation if a logarithmic transformation is "too weak"). We explored this in more detail in Lesson 9.
Weight the variances so that they can be different for each set of predictor values. This leads to weighted least squares, in which the data observations are given different weights when estimating the model. We'll cover this in Lesson 13.
A generalization of weighted least squares is to allow the regression errors to be correlated with one another in addition to having different variances. This leads to generalized least squares, in which various forms of nonconstant variance can be modeled.
For some applications, we can explicitly model the variance as a function of the mean, E(Y). This approach uses the framework of generalized linear models, which we discuss in the optional content.

Autocorrelation

One common way for the "independence" condition in a multiple linear regression model to fail is when the sample data have been collected over time and the regression model fails to effectively capture any time trends. In such a circumstance, the random errors in the model are often positively correlated over time, so each random error is more likely to be similar to the previous random error than it would be if the random errors were independent of one another. This phenomenon is known as autocorrelation (or serial correlation) and can sometimes be detected by plotting the model residuals versus time. We'll explore this further in the optional content.

Overfitting

When building a regression model, we don't want to include unimportant or irrelevant predictors whose presence can overcomplicate the model and increase our uncertainty about the magnitudes of the effects for the important predictors (particularly if some of those predictors are highly collinear). Such "overfitting" can occur the more complicated a model becomes and the more predictor variables, transformations, and interactions are added to a model. It is always prudent to apply a sanity check to any model being used to make decisions. Models should always make sense, preferably grounded in some kind of background theory or sensible expectation about the types of associations allowed between variables. Predictions from the model should also be reasonable (over-complicated models can give quirky results that may not reflect reality).

Excluding Important Predictor Variables

However, there is a potentially greater risk from excluding important predictors than from including unimportant ones. The linear association between two variables ignoring other relevant variables can differ both in magnitude and direction from the association that controls for other relevant variables. Whereas the potential cost of including unimportant predictors might be increased difficulty with interpretation and reduced prediction accuracy, the potential cost of excluding important predictors can be a completely meaningless model containing misleading associations. Results can vary considerably depending on whether such predictors are (inappropriately) excluded or (appropriately) included. These predictors are sometimes called confounding or lurking variables, and their absence from a model can lead to incorrect decisions and poor decision-making.

Example 12-6: Simpson's Paradox Section

An illustration of how a response variable can be positively associated with one predictor variable when ignoring a second predictor variable, but negatively associated with the first predictor when controlling for the second predictor. The dataset used in the example is available in this file: paradox.txt.

Video Explanation

Missing Data

Real-world datasets frequently contain missing values, so we do not know the values of particular variables for some of the sample observations. For example, such values may be missing because they were impossible to obtain during data collection. Dealing with missing data is a challenging task. Missing data has the potential to adversely affect a regression analysis by reducing the total usable sample size. The best solution to this problem is to try extremely hard to avoid missing data in the first place. When there are missing values that are impossible or too costly to avoid, one approach is to replace the missing values with plausible estimates, known as imputation. Another (easier) approach is to consider only models that contain predictors with no (or few) missing values. This may be unsatisfactory, however, because even a predictor variable with a large number of missing values can contain useful information.

Power and Sample Size

In small datasets, a lack of observations can lead to poorly estimated models with large standard errors. Such models are said to lack statistical power because there is insufficient data to be able to detect significant associations between the response and predictors. So, how much data do we need to conduct a successful regression analysis? A common rule of thumb is that 10 data observations per predictor variable are a pragmatic lower bound for sample size. However, it is not so much the number of data observations that determines whether a regression model is going to be useful, but rather whether the resulting model satisfies the LINE conditions. In some circumstances, a model applied to fewer than 10 data observations per predictor variable might be perfectly fine (if, say, the model fits the data really well and the LINE conditions seem fine), while in other circumstances a model applied to a few hundred data points per predictor variable might be pretty poor (if, say, the model fits the data badly and one or more conditions are seriously violated). For another example, in general, we’d need more data to model interaction compared to a similar model without the interaction. However, it is difficult to say exactly how much data would be needed. It is possible that we could adequately model interaction with a relatively small number of observations if the interaction effect was pronounced and there was little statistical error. Conversely, in datasets with only weak interaction effects and relatively large statistical errors, it might take a much larger number of observations to have a satisfactory model. In practice, we have methods for assessing the LINE conditions, so it is possible to consider whether an interaction model approximately satisfies the assumptions on a case-by-case basis. In conclusion, there is not really a good standard for determining sample size given the number of predictors, since the only truthful answer is, “It depends.” In many cases, it soon becomes pretty clear when working on a particular dataset if we are trying to fit a model with too many predictor terms for the number of sample observations (results can start to get a little odd, and standard errors greatly increase). From a different perspective, if we are designing a study and need to know how much data to collect, then we need to get into sample size and power calculations, which rapidly become quite complex. Some statistical software packages will do sample size and power calculations, and there is even some software specifically designed to do just that. When designing a large, expensive study, it is recommended that such software be used or get advice from a statistician with sample size expertise.