5.6.1 - Inference for Independent Means

As with comparing two population proportions, when we compare two population means from independent populations, the interest is in the difference between the two means. In other words, if \(\mu_1\) is the population mean from population 1 and \(\mu_2\) is the population mean from population 2, then the difference is \(\mu_1-\mu_2\). If \(\mu_1-\mu_2=0\) then there is no difference between the two population parameters.

If each population is normal, then the sampling distribution of \(\bar{x}_i\) is normal with mean \(\mu_i\), standard error \(\dfrac{\sigma_i}{\sqrt{n_i}}\), and the estimated standard error \(\dfrac{s_i}{\sqrt{n_i}}\), for \(i=1, 2\).

Using the Central Limit Theorem, if the population is not normal, then with a large sample, the sampling distribution is approximately normal.

The theorem presented in this Lesson says that if either of the above are true, then \(\bar{x}_1-\bar{x}_2\) is approximately normal with mean \(\mu_1-\mu_2\), and standard error \(\sqrt{\dfrac{\sigma^2_1}{n_1}+\dfrac{\sigma^2_2}{n_2}}\).

That all sounds great, however, in most cases, \(\sigma_1\) and \(\sigma_2\) are unknown, and they have to be estimated. It seems natural to estimate \(\sigma_1\) by \(s_1\) and \(\sigma_2\) by \(s_2\). When the sample sizes are small, the estimates may not be that accurate and one may get a better estimate for the common standard deviation by pooling the data from both populations if the standard deviations for the two populations are not that different, however if the standard deviations are different, then we want to include that difference in our test.

Given this, there are two options for estimating the variances for the independent samples:

Using pooled variances
Using unpooled (or unequal) variances

When to use which? Well, first, the nice thing is that many software packages calculate the variances "behind the curtain" and will show you the most appropriate output. However, if you are NOT sure, you can always use the unpooled method. The consequence of using unpooled is that the test is more conservative making it marginally more difficult to reject the null. However, the consequence of using pooled variances is an incorrect model.