Lesson 2: Confidence Intervals and Sample Size

Lesson 2: Confidence Intervals and Sample Size


Using the result of confidence intervals from the last lesson, this lesson starts with a discussion on selecting sample size for estimating the population mean as well as the population total by a confidence interval with a specified margin of error and specified level of confidence. In the second section, the confidence interval for estimating a population proportion is discussed. In the last section, sample sizes needed for estimating a population proportion are discussed. Both the educated guess and conservative method are introduced.

Lesson 2:  Ch. 4.1-4.2, 5.1-5.3, of Sampling by Steven Thompson, 3rd edition.


Upon completion of this lesson, you should be able to:

  • find the sample size needed for estimating population mean and population total
  • know how to compute the confidence interval for population proportion
  • find the sample size needed for estimating population proportion by both the educated guess method and conservative method
  • find the sample size to estimate population proportion, and
  • know when to use educated guess method and when to use conservative method

2.1 - Sample Size for Estimating Population Mean and Total

2.1 - Sample Size for Estimating Population Mean and Total

How large is a sample size that is large enough for estimating the population mean?

If \(\hat{\theta}\)  is an unbiased, normally distributed estimator of \(\theta\), then

\(\dfrac{\hat{\theta}-\theta}{\sqrt{Var(\hat{\theta})}} \sim N(0,1)\)

Then \(P\left(\dfrac{|\hat{\theta}-\theta|}{\sqrt{Var(\hat{\theta})}} > z_{\alpha/2}  \right)=\alpha \)

\( P\left(|\hat{\theta}-\theta|>z_{\alpha/2} \cdot \sqrt{Var(\hat{\theta})} \right)= \alpha \)

And, if we specify this \(\alpha\) we can then try to find out the sample size large enough to achieve the goal of your experiment.

So, we need to ask, "What is the goal of your experiment?" This is perhaps the most important question asked as a part of your experiment.

Example: What if we were interested in estimating the average weight of Penn State male students. How many samples should we plan on taking? We want to estimate this mean. What do we need to consider?

  1. The variability of the data, the measure that you are estimating is your first concern. This directly affects how many samples you will need.
  2. The second thing that you need to think about is the type of conclusion that you would like to report. That is, you need to specify the \(1 - \alpha\) value that you are happy with.
  3. How accurate (precision) do you want this estimate to be? You thus need to specify the margin of error.

Now, if we specify \(1-\alpha\) , the margin of error d (also can be viewed as the half width of the \((1 - \alpha)\)100% CI), we can solve for the sample size such that the CI has the specified margin of error.

For estimating population mean, the equation becomes:

\(P\left(|\bar{x}-\mu|>z_{\alpha/2} \cdot \sqrt{\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n}}\right)=\alpha\)

\(z_{\alpha/2}\sqrt{\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n}}=d\)

\(n=\dfrac{1}{\dfrac{ d^2}{ z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\)

Can we now use this formula to estimate the sample size?

What is the weak point of this formula? The weak point is the estimate of the population variance used. We do not know what this is!

Similarly, for estimating the population total \(\tau\), here is the formula:

\(P\left(|\hat{\tau}-\tau|>z_{\alpha/2} \cdot \sqrt{N(N-n)\dfrac{\sigma^2}{n}} \right)=\alpha\)


\(n=\dfrac{1}{\dfrac{ d^2}{ N^2 \cdot z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\)

Example 2-1: Beetles - Sample size

What sample size is needed to estimate the population total, \(\tau\), to within d = 1000 with a 95% CI?

Now, let's begin plugging what we know into the formula. We know N = 100, \(\alpha\) = 0.05. Do we know \(\sigma^2\)? No, but we can estimate \(\sigma^2\) by \(s^2\) = 1932.657.

How many should we sample? Let's calculate this out and:

\(n=\dfrac{1}{\dfrac{ d^2}{ N^2 \cdot z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\)

\(n=\dfrac{1}{\dfrac{ (1000)^2}{ (100)^2 \cdot (1.96)^2 \cdot 1932.657}+\dfrac{1}{100}}=42.610\)

We will always round this up, therefore, we will sample 43 of the 100 plots.

Try it!

What is the major point that was just illustrated in the previous example?

In this first example, N = 100 is not very large compared to n, so one should not ignore the finite population adjustment!

But wait a minute, should you have any cause for concern about the answer, n = 43, that we obtained?

What about the value that we used for \(\sigma^2\), (1932.657)?

Let's take a look at this more iterative method:

\(n=\dfrac{1}{\dfrac{d^2}{N^2\cdot t^2 \cdot s^2}+\dfrac{1}{N}}\)

Complication: t values depend on n. First we will use n = 43, and the t for df = 42 is 2.0181

\(n=\dfrac{1}{\dfrac{(1000)^2}{(100)^2\cdot (2.0181)^2 \cdot 1932.657}+\dfrac{1}{100}}=44.044\)

Round up to 45, t for 44 df is 2.0154.

\(n=\dfrac{1}{\dfrac{(1000)^2}{(100)^2\cdot (2.0154)^2 \cdot 1932.657}+\dfrac{1}{100}}=43.978\)

Here, we get n = 44. So, we see that the conservative answer is to take n = 45.

Consequently, our final answer will be to take 45 samples.

 In the beetle example, there are data to estimate \(\sigma^2\). What can one do if there is no pilot data? How can we get some rough idea about what \(\sigma\) is? How is this possible? How do we do this?

Example 2-2: Average Weight Gain of Pigs

A farm has 1000 young pigs with an initial weight of about 50 lbs. They put them on a new diet for 3 weeks and want to know how many pigs to sample so that they can estimate the average weight gain. They want the answer to be within 2 lbs. with 90% confidence.

There is no pilot data here. We don't have the time to select out some pigs in order to get an estimate for \(\sigma\), the standard deviation of the weight gain.

Question: How do we get a rough estimate of \(\sigma\)?

What would be a reasonable measure that would help this farmer to give him some guidance on how to estimate the standard deviation of the weight gain?

One thing we can do is to rely on the information that we already have, i.e., find some historical data that exists on this topic. But what if this historical data does not exist?

Try it!

Could we find a rough estimate for \(\sigma\)?

For certain variables we can make reasonable guesses for an estimate of \(\sigma\). Here is a formula for this rough estimate:

\(\sigma \approx \frac{Range}{4}\)

The range is relatively easy to have some idea about. This is an important point. Even though perhaps none of us has raised pigs we can still come up with a sensible guess. So, for this case we will make a sensible guess of the range of weight gain and intuitively estimate this to be from a minimum of 10 lbs, to a maximum of 50 lbs within this 3 week period.

\(\sigma\) can now be roughly estimated to be:

\(\dfrac{Range}{4}=\dfrac{50-10}{4}=10\qquad lbs\)

Now we can use the formula for estimating the mean, \(\mu\). Then,

 n & = \dfrac{1}{\dfrac{ d^2}{ z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}} \\
 & = \dfrac{1}{\dfrac{ 2^2}{ (1.645)^2 \cdot (10)^2}+\dfrac{1}{1000}} \\
 & = 63.36 \\

Round up to 64.


We will need to sample 64 pigs in order to estimate the average weight gain in 3 weeks to within 2 lbs. with a 90% confidence interval.

2.2 - Confidence Intervals for Population Proportion

2.2 - Confidence Intervals for Population Proportion

Estimating Proportions

Estimating population proportions can be seen as a particular case of estimating the population mean. Many things that belong to the problems associated with the mean problem can be borrowed and used when working with proportions...

We want to estimate the proportion of units in the population having some attribute. For example a question might be, "What would be the proportion of Penn State students who are smokers?" Another example is, "What would be the proportion of people preferring a type of presentation?"

The Gallop Poll: Most are based on telephone interviews with a significant portion based on interviews conducted in person from home visits. Usually the sample size is at least 1000, sometimes even 1500.

Here are a number of interesting web sites associated with estimating proportions:

Let's see in what ways the proportion problem is related to the mean problem...

Example: Do you approve of President Bush's job performance?


\( \quad y_i =
0 & \text{no} \\
1 & \text{yes}
\end{cases} \)

The population unit is: 1, 2, ... , N.

The variable of interest: \(y_1,y_2,...,y_N\).

Population proportion: \(p=\dfrac{1}{N} \sum\limits_{i=1}^N y_i\), which is the population mean, \(\mu\).

If we take a simple random sample of size n, then

\(\hat{p}= \sum\limits_{i=1}^n \dfrac{y_i}{n}=\bar{y}\)

This specific definition of yi makes it having a variance that is related to its mean.

To find the finite population variance for \(y_1,y_2,...,y_N\), we know that the population mean is:

\(\mu=\dfrac{1}{N} \sum\limits_{i=1}^N y_i =p\)

By definition the variance is then:

\sigma^2 & = \dfrac{\sum\limits_{i=1}^{N}(y_i-p)^2}{N-1} \\
& = \dfrac{\sum\limits_{i=1}^{N}(y_i^2-2py_i+p^2)}{N-1} \\
& = \dfrac{\sum\limits_{i=1}^{N}y_i^2-2p\sum\limits_{i=1}^N y_i+Np^2}{N-1} \\

Then, since \(y^2_i\) = \(y_i\) :

& = & \dfrac{\sum\limits_{i=1}^{N}y_i-2p\sum\limits_{i=1}^N y_i+Np^2}{N-1} \\
& = & \dfrac{Np-2p(Np)+Np^2}{N-1} \\
\sigma^2 & = & \dfrac{Np-Np^2}{N-1}=\dfrac{Np(1-p)}{N-1}

Theoretically this is the variance.

How will we estimate this? We can estimate this by:

\(\hat{\sigma}^2=s^2=\dfrac{n}{n-1}\hat{p}\cdot (1-\hat{p})\)

What we want is to see how \(\hat{p}\) behaves, therefore, we want to know its distribution. First, we find its mean, then its variance.

Since \(\hat{p}\) is \(\bar{y}\), we can get \(E(\hat{p})=\mu=p\). Then, we proceed to find its variance.

Var(\hat{p}) & = \left(1-\dfrac{n}{N}\right)\cdot \dfrac{\sigma^2}{n} \\
& = \left(\dfrac{N-n}{N}\right)\cdot \dfrac{N \cdot p \cdot (1-p)}{(N-1)\cdot n} \\
& = \left(\dfrac{N-n}{N-1}\right)\cdot \dfrac{p \cdot (1-p)}{n} \\

How will we estimate the variance of \(\hat{p}\)? There are many answers for how to do this. One method would be to use maximum likelihood, another would be to find the unbiased estimator.

An unbiased estimator of the variance is:

\(\hat{V}ar(\hat{p})=\left(\dfrac{N-n}{N}\right) \cdot \dfrac{\hat{p} \cdot (1-\hat{p})}{n-1}\)

This is one reasonable answer for determining an estimate of the variance. The answer will not be very different from what one would get using other methods.

What about for confidence intervals? For this we need to know the distribution of \(\hat{p}\). When the sample size is large we know that \(\hat{p}\) has a normal distribution by the Central Limit Theorem. Therefore, we can use the t interval:

\(\text{Answer:}\quad \hat{p} \pm t_{\alpha/2} \sqrt{\hat{V}ar(\hat{p})}\)

How large is large enough?

\(\text{Answer: if } n \cdot \hat{p}\geq 5,\quad n \cdot (1-\hat{p})\geq 5.\)

We have fairly precise criteria here for whether or not to use t when constructing the confidence interval.

Example 2-3: Presidential Approval Rating

Let's revisit the previous example about President Bush's final approval rating.

From CBS New (Jan 21, 2009) from the web site: President Bush's Final Approval Rating.

President Bush's final approval rating is 22%!

If you read the web site you can learn a lot about the specifics on this poll. The poll was conducted by telephone interview to 1,112 adults nationwide.

After looking at this statistic, provide a 95% CI for the true proportion. The 22% is a sample proportion - what is the true population proportion?


\(\hat{V}ar(\hat{p})=\left(\dfrac{N-n}{N}\right) \cdot \dfrac{\hat{p}\cdot (1-\hat{p})}{n-1}=1\cdot \dfrac{0.22 \times 0.78}{1112-1}=0.0001545\)

And a 95% confidence interval for p is:

\(0.22 \pm 1.96 \sqrt{0.0001545}\)

\(=0.22 \pm 0.0244\)

2.3 - Sample Size Needed for Estimating Proportion

2.3 - Sample Size Needed for Estimating Proportion

Using the formula to find sample size for estimating the mean we have:

\(n=\dfrac{1}{\dfrac{ d^2}{ z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\)

Now, \(\sigma^2=\dfrac{N}{N-1}\cdot p \cdot (1-p)\)substitutes in and we get:

\(n=\dfrac{N \cdot p \cdot (1-p)}{(N-1)\dfrac{d^2}{z^2_{\alpha/2}}+p\cdot(1-p)}\)

When the finite population correction can be ignored, the formula is:

\(n\approx \dfrac{z^2_{\alpha/2}\cdot p \cdot (1-p)}{d^2}\)

Now, for finding sample sizes for proportion, in addition to using an educated guess to estimate p, we can also find a conservative sample size which can guarantee the margin of error is short enough at a specified \(\alpha\).

  1. Educated guess (estimate p by \(\hat{p}\) ):


    Note, \(\hat{p}\) may be different from the true proportion. The sample size may not be large enough for some cases, (i.e., the margin of error not as small as specified).

  2. Conservative sample size:

    Since p(1 - p) attains maximum at p = 1/2, a conservative estimate for sample size is:

    \(n=\dfrac{N\cdot 1/4}{(N-1)\dfrac{d^2}{z^2_{\alpha/2}}+1/4}\)

Example 2-4: Presidential Approval Rating - Sample size

To estimate the next president's final approval rating, how many people should be sampled so that the margin of error is 3%, (a popular choice), with 95% confidence?

  1. Use educated guess: Bush's = 0.22

    Since N is very large compared to n, finite population correction is not needed.

    n &=\dfrac{\hat{p}\cdot(1-\hat{p})\cdot z^2_{\alpha/2}}{d^2}\\

    round up to 733

  2. Use conservative approach.

    n &=\dfrac{0.5\cdot0.5\cdot1.96^2}{0.03^2}\\

    round up to 1068.

Try it!

How do we choose between the educated guess or the conservative approach?

One should look at the cost of sampling extra units versus the set-up cost of the sampling process once more. If the set-up cost (maybe needed if an educated guess is used) of the sampling procedure once more is high compared to the cost of sampling extra units, then one will prefer to use a conservative approach.

  1. Find the proportion of CD players in this shipment that have lifetime longer than 2000 hours. The proportion from last shipment was 0.9. It is not costly to set up the testing procedure again if needed whereas sampling cost of each unit is expensive. We want to estimate the proportion to within 0.01 with 95% confidence.  Would you use the educated guess or the conservative approach?

    We should use an educated guess because it is not costly to set up the testing procedure again. On the other hand, the cost of the sampling of extra units is high due to the nature of the test.

  2. Get a ship out to the Bering Sea to sample the proportion of fish that have mercury level within a specified level. Last year the proportion is 0.9. Want to estimate the proportion to within 0.01 with 95% confidence.  Would you use the educated guess or the conservative approach?

    We should use a conservative approach because it is too expensive to send a ship out again if needed.

Exact intervals for population proportions

Since \(Y_i\) are defined as 1 or 0 depending on whether the unit has the attribute or not and the sampling is without replacement, one can see that, to be exact, \(\sum Y_i\)has a hypergeometric distribution.

Using this property, one can obtain exact confidence interval for p. When the total number of successes and total number of failures are large (larger than 5), we can use the t-interval. (can use z-interval if n > 50).

Sample size for estimating several proportions simultaneously

It is good to know that there is a solution in the following scenario: 

There are a few (maybe unknown) classes and one wants to collect enough samples so that the proportion in each class can be estimated to within a certain prescribed precision. (Details not needed, if interested, read   Ch. 5.4 and the reference cited there.)

Has Tooltip/Popover
 Toggleable Visibility