In the preceding section we had to make some pretty restrictive assumptions (normality, known mean, known variance) in order to make statistical inferences. We now explore the connection between samples and populations a little more closely so that we can draw conclusions using fewer assumptions.

Recall that the population is the entire collection of objects under consideration, while the sample is a (random) subset of the population. We are particularly interested in making statistical inferences not only about values in the population, denoted Y, but also about numerical summary measures such as the population mean, denoted E(Y )—these population summary measures are called parameters. While population parameters are unknown (in the sense that we do not have all the individual population values and so cannot calculate them), we can calculate similar quantities in the sample, such as the sample mean—these sample summary measures are called statistics.

Next we’ll see how statistical inference essentially involves estimating population parameters (and assessing the precision of those estimates) using sample statistics. When our sample data is a subset of the population that has been selected randomly, statistics calculated from the sample can tell us a great deal about corresponding population parameters. For example, a sample mean tends to be a good estimate of the population mean, in the following sense. If we were to take random samples over and over again, each time calculating a sample mean, then the mean of all these sample means would be equal to the population mean. Such an estimate is called unbiased since on average it estimates the correct value. It is not actually necessary to take random samples over and over again to show this—probability theory (beyond the scope of this book) allows us to prove such theorems.

However, it is not enough to just have sample statistics (such as the sample mean) that average out (over a large number of hypothetical samples) to the correct target (i.e., the population mean). We would also like sample statistics that would have "low" variability from one hypothetical sample to another. At the very least we need to be able to quantify this variability, known as sampling uncertainty. One way to do this is to consider the sampling distribution of a statistic, that is, the distribution of values of a statistic under repeated (hypothetical) samples. Again, we can use results from probability theory to tell us what these sampling distributions are. So, all we need to do is take a single random sample, calculate a statistic, and we’ll know the theoretical sampling distribution of that statistic (i.e., we’ll know what the statistic should average out to over repeated samples, and how much the statistic should vary over repeated samples).

Central limit theorem—normal version

Suppose that a random sample of n data values, represented by Y1, Y2, ..., Yn, comes from a population that has a mean of E(Y) and a standard deviation of SD(Y). The sample mean, mY, is a pretty good estimate of the population mean, E(Y). The sampling distribution of this statistic derives from the central limit theorem. This theorem states that under very general conditions, the sample mean has an approximate normal distribution with mean E(Y) and standard deviation SD(Y)/ √n (under repeated sampling). In other words, if we were to take a large number of random samples of n data values and calculate the mean for each sample, the distribution of these sample means would be a normal distribution with mean E(Y) and standard deviation SD(Y)/ √n. Since the mean of this sampling distribution is E(Y), mY is an unbiased estimate of E(Y).

An amazing fact about the central limit theorem is that there is no need for the population itself to be normal (remember that we had to assume this for the calculations in Section 1.3). However, the more symmetric the distribution of the population, the better is the normal approximation for the sampling distribution of the sample mean. Also, the approximation tends to be better the larger the sample size n.

The central limit theorem by itself won't help us to draw statistical inferences about the population without still having to make some restrictive assumptions. However, it is certainly a step in the right direction. Consider the home prices example again. As in Section 1.3, we’ll assume that E(Price)=280 and SD(Price)=50, but now we no longer need to assume that the population is normal. Imagine taking a large number of random samples of size 30 from this population and calculating the mean sale price for each sample. To get a better handle on the sampling distribution of these mean sale prices, we’ll find the 90th percentile of this sampling distribution. Let’s do the calculation first, and then see why this might be a useful number to know.

First, we need to get some notation straight. In this section we're not thinking about the specific sample mean we got for our actual sample of 30 sale prices, mY = 278.6033. Rather we’re imagining a list of potential sample means from a population distribution with mean 280 and standard deviation 50—we'll call a potential sample mean in this list MY. From the central limit theorem, the sampling distribution of MY is normal with mean 280 and standard deviation 50 / √30 = 9.129. Then the standardized Z-value from MY,

Z = (MY − E(Y))/ SD(Y )/√n = (MY − 280) / 9.129,

is standard normal with mean 0 and standard deviation 1. From the table in Section 1.1, the 90th percentile of a standard normal random variable is 1.282 (since the horizontal axis value of 1.282 corresponds to an upper-tail area of 0.1). Then

Pr(Z < 1.282) = Pr((MY − 280) / 9.129 < 1.282) = Pr(MY < 1.282(9.129) + 280) = Pr(MY < 291.703).

Thus, the 90th percentile of the sampling distribution of MY is \(\$\)292,000. In other words, under repeated sampling, MY has a distribution with an area of 0.90 to the left of \(\$\)292,000 (and an area of 0.10 to the right of \(\$\)292,000). This illustrates a crucial distinction between the distribution of population Y-values and the sampling distribution of MY—the latter is much less spread out. For example, suppose for the sake of argument that the population distribution of Y is normal (although this is not actually required for the central limit theorem to work). Then we can do a similar calculation to the one above to find the 90th percentile of this distribution (normal with mean 280 and standard deviation 50). In particular,

Pr(Z < 1.282) = Pr(Y − 280) / 50 < 1.282) = Pr(Y < 1.282(50) + 280) = Pr(Y < 344.100).

Thus, the 90th percentile of the population distribution of Y is \(\$\)344,000. This is much larger than the value we got above for the 90th percentile of the sampling distribution of MY (\(\$\)292,000). This is because the sampling distribution of MY is less spread out than the population distribution of Y—the standard deviations for our example are 9.129 for the former and 50 for the latter.

We can again turn these calculations around. For example, what is the probability that MY is greater than 291.703? To answer this, consider the following calculation:

Pr(MY > 291.703) = Pr((MY − 280) / 9.129 > (291.703 − 280) / 9.129 = Pr(Z > 1.282) = 0.10.

So, the probability that MY is greater than 291.703 is 0.10.

Central limit theorem—t-version

One major drawback to the normal version of the central limit theorem is that to use it we have to assume that we know the value of the population standard deviation, SD(Y). A generalization of the standard normal distribution called Student’s t-distribution solves this problem. The density curve for a t-distribution looks very similar to a normal density curve, but the tails tend to be a little "thicker," so t-distributions are a little more spread out than the normal distribution. This "extra variability" is controlled by an integer number called the degrees of freedom. The smaller this number, the more spread out the t-distribution density curve (conversely, the higher the degrees of freedom, the more like a normal density curve it looks).

For example, the following table shows tail areas for a t-distribution with 29 degrees of freedom:

Upper-tail area 0.1 0.05 0.025 0.01 0.005 0.001
Critical value of t29 1.311 1.699 2.045 2.462 2.756 3.396
Two-tail area 0.2 0.1 0.05 0.02 0.01 0.002

Compared with the corresponding table for the normal distribution in Section 1.2, the critical values (i.e., horizontal axis values or percentiles) are slightly larger in this table.

We will use the t-distribution from this point on because it will allow us to use an estimate of the population standard deviation (rather than having to assume this value). A reasonable estimate to use is the sample standard deviation, sY. Since we will be using an estimate of the population standard deviation, we will be a little less certain about our probability calculations—this is why the t-distribution needs to be a little more spread out than the normal distribution, to adjust for this extra uncertainty. This extra uncertainty will be of particular concern when we're not too sure if our sample standard deviation is a good estimate of the population standard deviation (i.e., in small samples). So, it makes sense that the degrees of freedom is lower for smaller sample sizes. In this particular application, we will use the t-distribution with n−1 degrees of freedom in place of a standard normal distribution in the following t-version of the central limit theorem.

Suppose that a random sample of n data values, represented by Y1, Y2, ... , Yn, comes from a population that has a mean of E(Y). Imagine taking a large number of random samples of n data values and calculating the mean and standard deviation for each sample. As before, we’ll let MY represent the imagined list of repeated sample means, and similarly, we’ll let SY represent the imagined list of repeated sample standard deviations. Define t = (MY − E(Y)) / (SY/√n).

Under very general conditions, t has an approximate t-distribution with n−1 degrees of freedom. The two differences from the normal version of the central limit theorem that we used before are that the repeated sample standard deviations, SY, replace an assumed population standard deviation, SD(Y), and that the resulting sampling distribution is a t-distribution (not a normal distribution).

So far, we have focused on the sampling distribution of sample means, MY, but what we would really like to do is infer what the observed sample mean, mY, tells us about the population mean, E(Y). Thus, while the calculations in this section have been useful for building up intuition about sampling distributions and manipulating probability statements, their main purpose has been to prepare the ground for the next two sections, which cover how to make statistical inferences about the population mean, E(Y).