1.5 - Interval Estimation

We have already seen that the sample mean, mY, is a good point estimate of the population mean, E(Y) (in the sense that it is unbiased). It is also helpful to know how reliable this estimate is, that is, how much sampling uncertainty is associated with it.

A useful way to express this uncertainty is to calculate an interval estimate or confidence interval for the population mean, E(Y). The interval should be centered at the point estimate (in this case, mY) since we are probably equally uncertain that the population mean could be lower or higher than this estimate (i.e., it should have the same amount of uncertainty either side of the point estimate). In other words, the confidence interval is of the form "point estimate ± uncertainty" or "(point estimate − uncertainty, point estimate + uncertainty)."

We can obtain the exact form of the confidence interval from the t-version of the central limit theorem, where t = (MY − E(Y)) / (SY/√n) has an approximate t-distribution with n−1 degrees of freedom. In particular, suppose that we want to calculate a 95% confidence interval for the population mean, E(Y), for the home prices example—in other words, an interval such that there will be an area of 0.95 between the two endpoints of the interval (and an area of 0.025 to the left of the interval in the lower tail, and an area of 0.025 to the right of the interval in the upper tail). Let's consider just one side of the interval first. Using the fact that 2.045 is the 97.5th percentile of the t-distribution with 29 degrees of freedom (see the table in Section 1.4), then

Pr(t29 < 2.045) = Pr(MY − E(Y)) / (SY/√n) < 2.045) = Pr(MY − 2.045(SY/√n) < E(Y)).

This probability statement must be true for all potential values of MY and SY. In particular, it must be true for our observed sample statistics, mY = 278.6033 and sY = 53.8656. Thus, to find the values of E(Y) that satisfy the probability statement, we plug in our sample statistics to find

MY − 2.045(SY/√n) = 278.6033 − 2.045(53.8656/√30) = 258.492.

This shows that a population mean greater than \(\$\)258,492 would satisfy the expression Pr(t29 < 2.045) = 0.975. In other words, we have found that the lower bound of our confidence interval is \(\$\)258,492, or approximately \(\$\)258,000.

To find the upper bound we perform a similar calculation to find that a population mean less than \(\$\)298,715 would satisfy the expression Pr(t29 < 2.045) = 0.975. In other words, we have found that the upper bound of our confidence interval is \(\$\)298,715, or approximately \(\$\)299,000.

We can combine these two calculations as

Pr(−2.045 < t29 < 2.045) = Pr(−2.045 < MY − E(Y)) / (SY/√n) < 2.045)

= Pr(MY − 2.045(SY/√n) < E(Y) < MY + 2.045(SY/√n)).

As before, we plug in our sample statistics to find the values of E(Y) that satisfy this expression:

Pr(278.6033 − 2.045(53.8656/√30) < E(Y) < 278.6033 + 2.045(53.8656/√30)

= Pr(258.492 < E(Y) < 298.715).

This shows that a population mean between \(\$\)258,492 and \(\$\)298,715 would satisfy the expression Pr (−2.045 < t29 < 2.045) = 0.95. In other words, we have found that a 95% confidence interval for E(Y) for this example is (\(\$\)258,492, \(\$\)298,715), or approximately (\(\$\)258,000, \(\$\)299,000).

More generally, using symbols, a 95% confidence interval for a univariate population mean, E(Y), results from the following:

Pr(−97.5th percentile < tn−1 < 97.5th percentile)

= Pr(−97.5th percentile < MY − E(Y)) / (SY/√n) < 97.5th percentile)

= Pr(MY − 97.5th percentile(SY/√n) < E(Y) < MY + 97.5th percentile(SY/√n))

where the 97.5th percentile comes from the t-distribution with n−1 degrees of freedom. In other words, plugging in our observed sample statistics, mY and sY, we can write the 95% confidence interval as mY ± 97.5th percentile (sY /√n).

For a lower or higher level of confidence than 95%, the percentile used in the calculation must be changed as appropriate. For example, for a 90% interval (i.e., with 5% in each tail), the 95th percentile would be needed, whereas for a 99% interval (i.e., with 0.5% in each tail), the 99.5th percentile would be needed. These percentiles are easily obtained using statistical software.

Confidence interval for a univariate mean, E(Y)

Thus, in general, we can write a confidence interval for a univariate mean, E(Y), as mY ± t-percentile (sY /√n), where the t-percentile comes from a t-distribution with n−1 degrees of freedom. The example above thus becomes

mY ± t-percentile (sY /√n) = 278.6033 ± 2.045 (53.8656 /√30) = 278.6033 ± 20.111 = (258.492, 298.715).

To interpret the confidence interval, loosely speaking we can say that "we're 95% confident that the mean single-family home sale price in this housing market is between \(\$\)258,000 and \(\$\)299,000." To provide a more precise interpretation we have to revisit the notion of hypothetical repeated samples. If we were to take a large number of random samples of size 30 from our population of sale prices and calculate a 95% confidence interval for each, then 95% of those confidence intervals would contain the (unknown) population mean. We do not know (nor will we ever know) whether the 95% confidence interval for our particular sample contains the population mean—thus, strictly speaking, we cannot say "the probability that the population mean is in our interval is 0.95." All we know is that the procedure that we have used to calculate the 95% confidence interval tends to produce intervals that under repeated sampling contain the population mean 95% of the time.

Before moving on to Section 1.6, which describes another way to make statistical inferences about population means—hypothesis testing—let us consider whether we can now forget the normal distribution. The calculations in this section are based on the central limit theorem, which does not require the population to be normal. We have also seen that t-distributions are more useful than normal distributions for calculating confidence intervals. For large samples, it doesn't make much difference (the percentiles for t-distributions get closer to the percentiles for the standard normal distribution as the degrees of freedom get larger), but for smaller samples it can make a large difference. So for this type of calculation we always use a t-distribution from now on. However, we can't completely forget about the normal distribution yet; it will come into play again in a different context in later lessons.

Degrees of freedom

When using a t-distribution, how do we know how many degrees of freedom to use? One way to think about degrees of freedom is in terms of the information provided by the data we are analyzing. Roughly speaking, each data observation provides one degree of freedom (this is where the n in the degrees of freedom formula comes in), but we lose a degree of freedom for each population parameter that we have to estimate. So, in this chapter, when we are estimating the population mean, the degrees of freedom formula is n−1. In Lesson 2, when we will be estimating two population parameters (the intercept and the slope of a regression line), the degrees of freedom formula will be n−2. For the remainder of the book, the general formula for the degrees of freedom in a multiple linear regression model will be n−(k+1) or n−k−1, where k is the number of predictor variables in the model. Note that this general formula actually also works for Chapter 2 (where k = 1) and even this chapter (where k = 0, since a linear regression model with zero predictors is equivalent to estimating the population mean for a univariate dataset).