1.7 - Random Errors and Prediction

So far, we have focused on estimating a univariate population mean, E(Y), and quantifying our uncertainty about the estimate via confidence intervals or hypothesis tests. In this section, we consider a different problem, that of "prediction." In particular, rather than estimating the mean of a population of Y-values based on a sample, Y1,...,Yn, consider predicting an individual Y-value picked at random from the population.

Intuitively, this sounds like a more difficult problem. Imagine that rather than just estimating the mean sale price of single-family homes in the housing market based on our sample of 30 homes, we have to predict the sale price of an individual single-family home that has just come onto the market. Presumably, we’ll be less certain about our prediction than we were about our estimate of the population mean (since it seems likely that we could be farther from the truth with our prediction than when we estimated the mean—for example, there is a chance that the new home could be a real bargain or totally overpriced). Statistically speaking, there is "extra uncertainty" that arises with prediction—the population distribution of data values, Y (more relevant to prediction problems), is much more variable than the sampling distribution of sample means, MY (more relevant to mean estimation problems).

We can tackle prediction problems with a similar process to that of using a confidence interval to tackle estimating a population mean. In particular, we can calculate a prediction interval of the form "point estimate ± uncertainty" or "(point estimate − uncertainty, point estimate + uncertainty)." The point estimate is the same one that we used for estimating the population mean, that is, the observed sample mean, mY. This is because mY is an unbiased estimate of the population mean, E(Y), and we assume that the individual Y-value we are predicting is a member of this population. As discussed in the preceding paragraph, however, the "uncertainty" is larger for prediction intervals than for confidence intervals. To see how much larger, we need to return to the notion of a model that we introduced in Section 1.2.

We can express the model we’ve been using to estimate the population mean, E(Y), as Y-value = deterministic part + random error or Yi = E(Y) + ei (i = 1,...,n). In other words, each sample Yi-value (the index i keeps track of the sample observations) can be decomposed into two pieces, a deterministic part that is the same for all values, and a random error part that varies from observation to observation. A convenient choice for the deterministic part is the population mean, E(Y), since then the random errors have a (population) mean of zero. Since E(Y) is the same for all Y-values, the random errors, e, have the same standard deviation as the Y-values themselves, that is, SD(Y). We can use this decomposition to derive the confidence interval and hypothesis test results of Sections 1.5 and 1.6 (although it would take more mathematics than we really need for our purposes in this course). Moreover, we can also use this decomposition to motivate the precise form of the uncertainty needed for prediction intervals (without having to get into too much mathematical detail).

In particular, write the Y-value to be predicted as Y*, and decompose this into two pieces as above: Y* = E(Y) + e*. Then subtract MY, which represents potential values of repeated sample means, from both sides of this equation: Y* − MY = (E(Y) − MY) +e*, which defines prediction error = estimation error + random error. Thus, whereas in estimating the population mean the only error we have to worry about is estimation error, in predicting an individual Y-value we have to worry about both estimation error and random error.

Prediction interval for an individual Y-value

Recall from Section 1.5 that the form of a confidence interval for the population mean is mY ± t-percentile(sY/√n). The term sY/√n in this formula is an estimate of the standard deviation of the sampling distribution of sample means, MY, and is called the standard error of estimation. The square of this quantity, sY2/n, is the estimated variance of the sampling distribution of sample means, MY. Then, thinking of E(Y) as some fixed, unknown constant, sY2/n is also the estimated variance of the estimation error, E(Y)−MY.

The estimated variance of the random error, e*, is sY2. It can then be shown that the estimated variance of the prediction error, Y* − MY, is sY2/n + sY2 = sY2(1/n+1) = sY2(1+1/n). Then sY√(1+1/n) is called the standard error of prediction and leads to the formula for a prediction interval for an individual Y-value as mY ± t-percentile(sY√(1+1/n)).

As with confidence intervals for the mean, the t-percentile used in the calculation comes from a t-distribution with n−1 degrees of freedom. For example, for a 95% interval (i.e., with 2.5% in each tail), the 97.5th percentile would be needed, whereas for a 90% interval (i.e., with 5% in each tail), the 95th percentile would be needed. For example, the 95% prediction interval for an individual value of Price picked at random from the population of single-family home sale prices is calculated as

mY ± t-percentile(sY√(1+1/n)) = 278.6033 ± 2.045(53.8656√(1+1/30))

= 278.6033 ± 111.976 = (166.627, 390.579).

To interpret the prediction interval, loosely speaking we can say that "we're 95% confident that the sale price for an individual home picked at random from all single-family homes in this housing market will be between \(\$\)167,000 and \(\$\)391,000." More precisely, if we were to take a large number of random samples of size 30 from our population of sale prices and calculate a 95% prediction interval for each, then 95% of those prediction intervals would contain the (unknown) sale price for an individual home picked at random from the population.

As discussed at the beginning of this section, this interval is much wider than the 95% confidence interval for the population mean single-family home sale price, which was calculated as

mY ± t-percentile(sY/√n) = 278.6033 ± 2.045(53.8656/√30)

= 278.6033 ± 20.111 = (258.492, 298.715).

Unlike for confidence intervals for the population mean, statistical software does not generally provide an automated method to calculate prediction intervals for an individual Y-value. Thus they have to be calculated by hand using the sample statistics, mY and sY. However, there is a trick that can get around this (although it makes use of simple linear regression, which we cover in Lesson 2). First, create a variable that consists only of the value 1 for all observations. Then, fit a simple linear regression model using this variable as the predictor variable and Y as the response variable, and restrict the model to fit "without an intercept." The estimated regression line for this model will be a horizontal line at a value equal to the sample mean of the response variable. Prediction intervals for this model will be the same for each value of the predictor variable, and will be the same as a prediction interval for an individual Y-value.

We derived the formula for a confidence interval for a univariate population mean from the t-version of the central limit theorem, which does not require the data Y-values to be normally distributed. However, the formula for a prediction interval for an individual univariate Y-value tends to work better for datasets in which the Y-values are at least approximately normally distributed.