4.1 - Background

In this lesson, we learn how to check the appropriateness of a simple linear regression model. Recall that the four conditions ("LINE") that comprise the simple linear regression model are:

  • Linear Function: The mean of the response, \(\mbox{E}(Y_i)\), at each value of the predictor, \(x_i\), is a Linear function of the \(x_i\).
  • Independent: The errors, \( \epsilon_{i}\), are Independent.
  • Normally Distributed: The errors, \( \epsilon_{i}\), at each value of the predictor, \(x_i\), are Normally distributed.
  • Equal variances: The errors, \( \epsilon_{i}\), at each value of the predictor, \(x_i\), have Equal variances (denoted \(\sigma^{2}\)).

An equivalent way to think of the first (linearity) condition is that the mean of the error, \(\mbox{E}(\epsilon_i)\), at each value of the predictor, \(x_i\), is zero. An alternative way to describe all four assumptions is that the errors, \(\epsilon_i\), are independent normal random variables with mean zero and constant variance, \(\sigma^2\).

The four conditions of the model pretty much tell us what can go wrong with our model, namely:

  • The population regression function is not linear. That is, the response \(Y_{i}\) is not a function of linear trend (\( \beta_{0}\) + \( \beta_{1}\) \(x_i\) ) plus some error \(\epsilon_i\) .
  • The error terms are not independent.
  • The error terms are not normally distributed.
  • The error terms do not have equal variance.

In this lesson, we learn ways to detect the above four situations, as well as learn how to identify the following two problems:

  • The model fits all but one or a few unusual observations. That is, are there any "outliers"?
  • An important predictor variable has been left out of the model. That is, could we do better by adding a second or third predictor into the model, and instead use a multiple regression model to answer our research questions?

Before jumping in, let's make sure it's clear why we have to evaluate any regression model that we formulate and subsequently estimate. In short, it's because:

  • All of the estimates, intervals, and hypothesis tests arising in a regression analysis have been developed assuming that the model is correct. That is, all the formulas depend on the model being correct!
  • If the model is incorrect, then the formulas and methods we use are at risk of being incorrect.

The good news is that some of the model conditions are more forgiving than others. So, we really need to learn when we should worry the most and when it's okay to be more carefree about model violations. Here's a pretty good summary of the situation:

  • All tests and intervals are very sensitive to even minor departures from independence.
  • All tests and intervals are sensitive to moderate departures from equal variance.
  • The hypothesis tests and confidence intervals for \( \beta_{0}\) and \( \beta_{1}\) are fairly "robust" (that is, forgiving) against departures from normality.
  • Prediction intervals are quite sensitive to departures from normality.

The important thing to remember is that the severity of the consequences is always related to the severity of the violation. And, how much you should worry about a model violation depends on how you plan to use your regression model. For example, if all you want to do with your model is to test for a relationship between x and y, i.e. test that the slope \( \beta_{1}\) is 0, you should be okay even if it appears that the normality condition is violated. On the other hand, if you want to use your model to predict a future response \(y_{\text{new}}\), then you are likely to get inaccurate results if the error terms are not normally distributed.

In short, you'll need to learn how to worry just the right amount. Worry when you should, and don't ever worry when you shouldn't! And when you are worried, there are remedies available, which we'll learn more about later in the course. For example, one thing to try is transforming either the response variable, predictor variable, or both - there is an example of this in Section 4.8 and we'll see more examples in Lesson 9.

This is definitely a lesson in which you are exposed to the idea that data analysis is an art (subjective decisions!) based on science (objective tools!). We might, therefore, call data analysis "an artful science!" Let's get to it!

The basic idea of residual analysis

Recall that not all of the data points in a sample will fall right on the least squares regression line. The vertical distance between any one data point \(y_i\) and its estimated value \(\hat{y}_i\) is its observed "residual":

\(e_i = y_i-\hat{y}_i\)

Each observed residual can be thought of as an estimate of the actual unknown "true error" term:

\(\epsilon_i = Y_i-E(Y_i)\)

Let's look at an illustration of the distinction between a residual \(e_{i}\) and an unknown true error term \( \epsilon_{i}\). The solid line on the plot describes the true (unknown) linear relationship in the population. Most often, we can't know this line. However, if we could, the true error would be the distance from the data point to the solid line.

On the other hand, the dashed line on the plot represents the estimated linear relationship for a random sample. The residual error is the distance from the data point to the dashed line. Click on the icon to zoom in and see the two types of errors.

The observed residuals should reflect the properties assumed for the unknown true error terms. The basic idea of residual analysis, therefore, is to investigate the observed residuals to see if they behave “properly.” That is, we analyze the residuals to see if they support the assumptions of linearity, independence, normality, and equal variances.