2  Confidence Intervals and Sample Size

Overview

Using the result of confidence intervals from the last lesson, this lesson starts with a discussion on selecting sample size for estimating the population mean as well as the population total by a confidence interval with a specified margin of error and specified level of confidence. In the second section, the confidence interval for estimating a population proportion is discussed. In the last section, sample sizes needed for estimating a population proportion are discussed. Both the educated guess and conservative methods are introduced.

Lesson 2: Ch. 4.1-4.2, 5.1-5.3, of Sampling by Steven Thompson, 3rd Edition.

Objectives

Upon completion of this lesson, you should be able to:

  1. Calculate the sample size needed for estimating population mean and population total,
  2. Compute the confidence interval for population proportion,
  3. Given a desired level of confidence for estimating a population proportion, determine the sample size required using both the educated guess method and conservative method, and critically evaluate the advantages and disadvantages of both approaches,
  4. Determine the necessary sample size to estimate the population proportion, and
  5. Choose between using the educated guess method and the conservative method.

2.1 Sample Size for Estimating Population Mean and Total

How large is a sample size that is large enough for estimating the population mean?

If \(\hat{\theta}\) is an unbiased, normally distributed estimator of \(\theta\):

\[\dfrac{\hat{\theta}-\theta}{\sqrt{\operatorname{Var}(\hat{\theta})}} \sim N(0,1)\]

Then:

\[ P\left(\dfrac{|\hat{\theta}-\theta|}{\sqrt{\operatorname{Var}(\hat{\theta})}} > z_{\alpha/2} \right)=\alpha\]

\[P\left(|\hat{\theta}-\theta|>z_{\alpha/2} \cdot \sqrt{\operatorname{Var}(\hat{\theta})} \right)= \alpha\]

Note! because we know that \(\hat{\theta}\) is normal, we can thus use the \(z\) distribution.

And, if we specify this \(\alpha\) we can then try to find out the sample size large enough to achieve the goal of your experiment.

So, we need to ask, “What is the goal of your experiment?” This is perhaps the most important question asked as a part of your experiment.

Example: What if we were interested in estimating the average weight of Penn State male students? How many samples should we plan on taking? We want to estimate this mean. What do we need to consider?

  1. The variability of the data and the measure that you are estimating is your first concern. This directly affects how many samples you will need.
  2. The second thing that you need to think about is the type of conclusion that you would like to report. That is, you need to specify the \(1 - \alpha\) value that you are happy with.
  3. How accurate (precision) do you want this estimate to be? You thus need to specify the margin of error.

Now, if we specify \(1-\alpha\), the margin of error \(d\) (also can be viewed as the half-width of the \((1 - \alpha)\) 100% CI), we can solve for the sample size such that the CI has the specified margin of error.

For estimating the population mean, the equation becomes:

\[P\left(|\bar{x}-\mu|>z_{\alpha/2} \cdot \sqrt{\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n}}\right)=\alpha\]

\[z_{\alpha/2}\sqrt{\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n}}=d\]

\[n=\dfrac{1}{\dfrac{ d^2}{ z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\]

Can we now use this formula to estimate the sample size?

What is the weak point of this formula? The weak point is the estimate of the population variance used. We do not know what this is!

Similarly, for estimating the population total \(\tau\), here is the formula:

\[P\left(|\hat{\tau}-\tau|>z_{\alpha/2} \cdot \sqrt{N(N-n)\dfrac{\sigma^2}{n}} \right)=\alpha\]

\[z_{\alpha/2}\sqrt{N(N-n)\dfrac{\sigma^2}{n}}=d\]

\[n=\dfrac{1}{\dfrac{ d^2}{ N^2 \cdot z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\]

Example 2.1 (Beetles: Sample Size) What sample size is needed to estimate the population total, \(\tau\), to within \(d = 1000\) with a 95% CI?

Now, let’s begin plugging what we know into the formula. We know \(N = 100\), \(\alpha = 0.05\). Do we know \(\sigma^2\)? No, but we can estimate \(\sigma^2\) by \(s^2 = 1932.657\).

How many should we sample? Let’s calculate this out:

\[n=\dfrac{1}{\dfrac{ d^2}{ N^2 \cdot z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\]

\[n=\dfrac{1}{\dfrac{ (1000)^2}{ (100)^2 \cdot (1.96)^2 \cdot 1932.657}+\dfrac{1}{100}}=42.610\]

We will always round this up, therefore, we will sample 43 of the 100 plots.

Note! If we ignore the finite population correction adjustment then,

\[\begin{align} n &= \dfrac{N^2 \cdot z^2_{\alpha/2} \cdot \sigma^2}{d^2} \\ &= \dfrac{(100)^2 \cdot (1.96)^2 \cdot 1932.657}{(1000)^2} \\ &= 74.245 \end{align}\]

which rounds up to 75. This value is much larger than 43.

Try It!

What is the major point that was just illustrated in the previous example?

In this first example, \(N = 100\) is not very large compared to \(n\), so one should not ignore the finite population adjustment!

But wait a minute, should you have any cause for concern about the answer, \(n = 43\), that we obtained?

What about the value that we used for \(\sigma^2\), (1932.657)?

Note! \(\sigma^2\) is not 1932.657. 1932.657 is the sample variance. Using \(z\) in the formula may be too aggressive. Sometimes people use \(t\) iteratively.

Let’s take a look at this more iterative method:

\[n=\dfrac{1}{\dfrac{d^2}{N^2\cdot t^2 \cdot s^2}+\dfrac{1}{N}}\]

Complication: \(t\) values depend on \(n\). First we will use \(n = 43\), and the \(t\) for \(df = 42\) is 2.0181.

\[n=\dfrac{1}{\dfrac{(1000)^2}{(100)^2\cdot (2.0181)^2 \cdot 1932.657}+\dfrac{1}{100}}=44.044\]

Round up to 45, \(t\) for 44 \(df\) is 2.0154.

\[n=\dfrac{1}{\dfrac{(1000)^2}{(100)^2\cdot (2.0154)^2 \cdot 1932.657}+\dfrac{1}{100}}=43.978\]

Here, we get \(n = 44\). So, we see that the conservative answer is to take \(n = 45\).

Consequently, our final answer will be to take 45 samples.

In the beetle example, there are data to estimate \(\sigma^2\). What can one do if there is no pilot data? How can we get a rough idea about what \(\sigma\) is? How is this possible? How do we do this?

Example 2.2 (Average Weight Gain of Pigs) A farm has 1000 young pigs with an initial weight of about 50 lbs. They put them on a new diet for 3 weeks and want to know how many pigs to sample so that they can estimate the average weight gain. They want the answer to be within 2 lbs with 90% confidence.

There is no pilot data here. We don’t have the time to select some pigs in order to get an estimate for \(\sigma\), the standard deviation of the weight gain.

Question: How do we get a rough estimate of \(\sigma\)?

What would be a reasonable measure that would help this farmer to give him some guidance on how to estimate the standard deviation of the weight gain?

One thing we can do is rely on the information that we already have, i.e., find some historical data that exists on this topic. But what if this historical data does not exist?

Try It!

Could we find a rough estimate for \(\sigma\)?

For certain variables, we can make reasonable guesses for an estimate of \(\sigma\). Here is a formula for this rough estimate:

\[\sigma \approx \frac{\text{Range}}{4}\]

The range is relatively easy to have some idea about. This is an important point. Even though perhaps none of us has raised pigs we can still come up with a sensible guess. So, for this case, we will make a sensible guess of the range of weight gain and intuitively estimate this to be from a minimum of 10 lbs to a maximum of 50 lbs within this 3-week period.

\(\sigma\) can now be roughly estimated to be:

\[\dfrac{\text{Range}}{4}=\dfrac{50-10}{4}=10 \text{ lbs}\]

Now we can use the formula for estimating the mean, \(\mu\). Then,

\[\begin{align} n &= \dfrac{1}{\dfrac{ d^2}{ z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}} \\ &= \dfrac{1}{\dfrac{ 2^2}{ (1.645)^2 \cdot (10)^2}+\dfrac{1}{1000}} \\ &= 63.36 \end{align}\]

Round up to 64.

Answer

We will need to sample 64 pigs in order to estimate the average weight gain in 3 weeks to within 2 lbs with a 90% confidence interval.

2.2 Confidence Intervals for Population Proportion

Estimating Proportions

Estimating population proportions can be seen as a particular case of estimating the population mean. Many things that belong to the problems associated with the mean problem can be borrowed and used when working with proportions…

We want to estimate the proportion of units in the population having some attribute. For example, a question might be, “What would be the proportion of Penn State students who are smokers?” Another example is, “What would be the proportion of people preferring a type of presentation?”

The Gallop Poll: Most are based on telephone interviews with a significant portion based on interviews conducted in person from home visits. Usually, the sample size is at least 1000, sometimes even 1500.

Here are a number of interesting websites associated with estimating proportions:

Let’s see in what ways the proportion problem is related to the mean problem…

Example

Do you approve of President Bush’s job performance?

Answer

\[y_{i} = \begin{cases} 0 & \text{no} \\ 1 & \text{yes} \end{cases}\]

The population unit: \(1, 2, \dots, N\).

The variable of interest: \(y_1,y_2, \dots,y_{N}\).

Population proportion: \(p=\dfrac{1}{N} \sum\limits_{i=1}^N y_i\), which is the population mean, \(\mu\).

If we take a simple random sample of size \(n\), then

\[\hat{p}= \sum\limits_{i=1}^n \dfrac{y_i}{n}=\bar{y}\]

This specific definition of \(y_i\) makes it have a variance that is related to its mean.

To find the finite population variance for \(y_1,y_2, \dots,y_{N}\), we know that the population mean is:

\[\mu=\dfrac{1}{N} \sum\limits_{i=1}^N y_i =p\]

By definition the variance is then:

\[\begin{align} \sigma^2 &= \dfrac{\sum\limits_{i=1}^{N}(y_i-p)^2}{N-1} \\ &= \dfrac{\sum\limits_{i=1}^{N}(y_i^2-2py_i+p^2)}{N-1} \\ &= \dfrac{\sum\limits_{i=1}^{N}y_i^2-2p\sum\limits_{i=1}^N y_i+Np^2}{N-1} \end{align}\]

Then, since \(y^2_i = y_i\):

\[\begin{align} &= \dfrac{\sum\limits_{i=1}^{N}y_i-2p\sum\limits_{i=1}^N y_i+Np^2}{N-1} \\ &= \dfrac{Np-2p(Np)+Np^2}{N-1} \\ \sigma^2 &= \dfrac{Np-Np^2}{N-1}=\dfrac{Np(1-p)}{N-1} \end{align}\]

Theoretically, this is the variance.

How will we estimate this? We can estimate this by:

\[\hat{\sigma}^2=s^2=\dfrac{n}{n-1}\hat{p}\cdot (1-\hat{p})\]

What we want is to see how \(\hat{p}\) behaves, therefore, we want to know its distribution. First, we find its mean, then its variance.

Since \(\hat{p}\) is \(\bar{y}\), we can get \(E(\hat{p})=\mu=p\). Then, we proceed to find its variance.

\[\begin{align} \operatorname{Var}(\hat{p}) &= \left(1-\dfrac{n}{N}\right)\cdot \dfrac{\sigma^2}{n} \\ &= \left(\dfrac{N-n}{N}\right)\cdot \dfrac{N \cdot p \cdot (1-p)}{(N-1)\cdot n} \\ &= \left(\dfrac{N-n}{N-1}\right)\cdot \dfrac{p \cdot (1-p)}{n} \\ \end{align}\]

How will we estimate the variance of \(\hat{p}\)? There are many answers for how to do this. One method would be to use maximum likelihood, another would be to find the unbiased estimator.

An unbiased estimator of the variance is:

\[\hat{\operatorname{Var}}(\hat{p})=\left(\dfrac{N-n}{N}\right) \cdot \dfrac{\hat{p} \cdot (1-\hat{p})}{n-1}\]

This is one reasonable answer for determining an estimate of the variance. The answer will not be very different from what one would get using other methods.

What about confidence intervals? For this, we need to know the distribution of \(\hat{p}\). When the sample size is large we know that \(\hat{p}\) has a normal distribution by the Central Limit Theorem. Therefore, we can use the \(t\) interval:

\[\text{Answer: } \hat{p} \pm t_{\alpha/2} \sqrt{\hat{\operatorname{Var}}(\hat{p})}\]

How large is large enough?

\[\text{Answer: if } n \cdot \hat{p}\geq 5, n \cdot (1-\hat{p})\geq 5.\]

We have fairly precise criteria here for whether or not to use \(t\) when constructing the confidence interval.

Example 2.3 (Presidential Approval Rating) Let’s revisit the previous example about President Bush’s final approval rating.

From CBS News (Jan 21, 2009) from the web site: President Bush’s Final Approval Rating

President Bush’s final approval rating is 22%!

If you read the website you can learn a lot about the specifics of this poll. The poll was conducted by telephone interview with 1,112 adults nationwide.

After looking at this statistic, provide a 95% CI for the true proportion. The 22% is a sample proportion - what is the true population proportion?

Answer

\[\hat{\operatorname{Var}}(\hat{p})=\left(\dfrac{N-n}{N}\right) \cdot \dfrac{\hat{p}\cdot (1-\hat{p})}{n-1}=1\cdot \dfrac{0.22 \times 0.78}{1112-1}=0.0001545\]

And a 95% confidence interval for \(p\) is:

\[0.22 \pm 1.96 \sqrt{0.0001545}\]

\[=0.22 \pm 0.0244\]

2.3 Sample Size Needed for Estimating Proportion

Using the formula to find the sample size for estimating the mean, we have:

\[n=\dfrac{1}{\dfrac{ d^2}{ z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\]

Now, \(\sigma^2=\dfrac{N}{N-1}\cdot p \cdot (1-p)\) substitutes in and we get:

\[n=\dfrac{N \cdot p \cdot (1-p)}{(N-1)\dfrac{d^2}{z^2_{\alpha/2}}+p\cdot(1-p)}\]

When the finite population correction can be ignored, the formula is:

\[n\approx \dfrac{z^2_{\alpha/2}\cdot p \cdot (1-p)}{d^2}\]

Now, for finding sample sizes for proportion, in addition to using an educated guess to estimate \(p\), we can also find a conservative sample size that can guarantee the margin of error is short enough at a specified \(\alpha\).

  1. Educated guess (estimate \(p\) by \(\hat{p}\)):

    \[n=\dfrac{N\cdot\hat{p}\cdot(1-\hat{p})}{(N-1)\dfrac{d^2}{z^2_{\alpha/2}}+\hat{p}\cdot(1-\hat{p})}\]

    Note, \(\hat{p}\) may be different from the true proportion. The sample size may not be large enough for some cases, (i.e., the margin of error is not as small as specified).

  2. Conservative sample size:

    Since \(p(1 - p)\) attains maximum at \(p = 1/2\), a conservative estimate for sample size is:

    \[n=\dfrac{N\cdot 1/4}{(N-1)\dfrac{d^2}{z^2_{\alpha/2}}+1/4}\]

Example 2.4 (Presidential Approval Rating: Sample Size) To estimate the next president’s final approval rating, how many people should be sampled so that the margin of error is 3%, (a popular choice), with 95% confidence?

  1. Use an educated guess: Bush’s = 0.22

    Since \(N\) is very large compared to \(n\), finite population correction is not needed.

    \[\begin{align} n &=\dfrac{\hat{p}\cdot(1-\hat{p})\cdot z^2_{\alpha/2}}{d^2}\\ &=\dfrac{0.22\cdot0.78\cdot1.96^2}{0.03^2}\\ &=732.47\\ \end{align}\]

    round up to 733.

  2. Use a conservative approach.

    \[\begin{align} n &=\dfrac{0.5\cdot0.5\cdot1.96^2}{0.03^2}\\ &=1067.11\\ \end{align}\]

    round up to 1068.

Try It!

How do we choose between the educated guess or the conservative approach?

One should look at the cost of sampling extra units versus the set-up cost of the sampling process once more. If the set-up cost (maybe needed if an educated guess is used) of the sampling procedure once more is high compared to the cost of sampling extra units, then one will prefer to use a conservative approach.

  1. Find the proportion of CD players in this shipment that have a lifetime longer than 2000 hours. The proportion from the last shipment was 0.9. It is not costly to set up the testing procedure again if needed whereas the sampling cost of each unit is expensive. We want to estimate the proportion to be within 0.01 with 95% confidence. Would you use the educated guess or the conservative approach?

    We should use an educated guess because it is not costly to set up the testing procedure again. On the other hand, the cost of the sampling of extra units is high due to the nature of the test.

  2. Get a ship out to the Bering Sea to sample the proportion of fish that have mercury levels within a specified level. Last year the proportion is 0.9. Want to estimate the proportion to be within 0.01 with 95% confidence. Would you use the educated guess or the conservative approach?

    We should use a conservative approach because it is too expensive to send a ship out again if needed.

Exact intervals for population proportions

Since \(Y_i\) is defined as 1 or 0 depending on whether the unit has the attribute or not and the sampling is without replacement, one can see that to be exact, \(\sum Y_i\) has a hypergeometric distribution.

Using this property, one can obtain an exact confidence interval for \(p\). When the total number of successes and a total number of failures are large (larger than 5), we can use the \(t\)-interval. (We can use \(z\)-interval if \(n > 50\)).

Sample size for estimating several proportions simultaneously

It is good to know that there is a solution to the following scenario:

There are a few (maybe unknown) classes and one wants to collect enough samples so that the proportion in each class can be estimated within a certain prescribed precision. (Details not needed, if interested, read Ch. 5.4 and the reference cited there.)