2.1 - Sample Size for Estimating Population Mean and Total

How large is a sample size that is large enough for estimating the population mean?

If \(\hat{\theta}\)  is an unbiased, normally distributed estimator of \(\theta\), then

\(\dfrac{\hat{\theta}-\theta}{\sqrt{Var(\hat{\theta})}} \sim N(0,1)\)

Then \(P\left(\dfrac{|\hat{\theta}-\theta|}{\sqrt{Var(\hat{\theta})}} > z_{\alpha/2}  \right)=\alpha \)

\( P\left(|\hat{\theta}-\theta|>z_{\alpha/2} \cdot \sqrt{Var(\hat{\theta})} \right)= \alpha \)

And, if we specify this \(\alpha\) we can then try to find out the sample size large enough to achieve the goal of your experiment.

So, we need to ask, "What is the goal of your experiment?" This is perhaps the most important question asked as a part of your experiment.

Example: What if we were interested in estimating the average weight of Penn State male students? How many samples should we plan on taking? We want to estimate this mean. What do we need to consider?

  1. The variability of the data and the measure that you are estimating is your first concern. This directly affects how many samples you will need.
  2. The second thing that you need to think about is the type of conclusion that you would like to report. That is, you need to specify the \(1 - \alpha\) value that you are happy with.
  3. How accurate (precision) do you want this estimate to be? You thus need to specify the margin of error.

Now, if we specify \(1-\alpha\), the margin of error d (also can be viewed as the half-width of the \((1 - \alpha)\)100% CI), we can solve for the sample size such that the CI has the specified margin of error.

For estimating the population mean, the equation becomes:

\(P\left(|\bar{x}-\mu|>z_{\alpha/2} \cdot \sqrt{\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n}}\right)=\alpha\)

\(z_{\alpha/2}\sqrt{\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n}}=d\)

\(n=\dfrac{1}{\dfrac{ d^2}{ z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\)

Can we now use this formula to estimate the sample size?

What is the weak point of this formula? The weak point is the estimate of the population variance used. We do not know what this is!

Similarly, for estimating the population total \(\tau\), here is the formula:

\(P\left(|\hat{\tau}-\tau|>z_{\alpha/2} \cdot \sqrt{N(N-n)\dfrac{\sigma^2}{n}} \right)=\alpha\)


\(n=\dfrac{1}{\dfrac{ d^2}{ N^2 \cdot z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\)

Example 2-1: Beetles - Sample size Section

What sample size is needed to estimate the population total, \(\tau\), to within d = 1000 with a 95% CI?

Now, let's begin plugging what we know into the formula. We know N = 100, \(\alpha\) = 0.05. Do we know \(\sigma^2\)? No, but we can estimate \(\sigma^2\) by \(s^2\) = 1932.657.

How many should we sample? Let's calculate this out:

\(n=\dfrac{1}{\dfrac{ d^2}{ N^2 \cdot z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\)

\(n=\dfrac{1}{\dfrac{ (1000)^2}{ (100)^2 \cdot (1.96)^2 \cdot 1932.657}+\dfrac{1}{100}}=42.610\)

We will always round this up, therefore, we will sample 43 of the 100 plots.

Try it!

What is the major point that was just illustrated in the previous example?

In this first example, N = 100 is not very large compared to n, so one should not ignore the finite population adjustment!

But wait a minute, should you have any cause for concern about the answer, n = 43, that we obtained?

What about the value that we used for \(\sigma^2\), (1932.657)?

Let's take a look at this more iterative method:

\(n=\dfrac{1}{\dfrac{d^2}{N^2\cdot t^2 \cdot s^2}+\dfrac{1}{N}}\)

Complication: t values depend on n. First we will use n = 43, and the t for df = 42 is 2.0181

\(n=\dfrac{1}{\dfrac{(1000)^2}{(100)^2\cdot (2.0181)^2 \cdot 1932.657}+\dfrac{1}{100}}=44.044\)

Round up to 45, t for 44 df is 2.0154.

\(n=\dfrac{1}{\dfrac{(1000)^2}{(100)^2\cdot (2.0154)^2 \cdot 1932.657}+\dfrac{1}{100}}=43.978\)

Here, we get n = 44. So, we see that the conservative answer is to take n = 45.

Consequently, our final answer will be to take 45 samples.

 In the beetle example, there are data to estimate \(\sigma^2\). What can one do if there is no pilot data? How can we get a rough idea about what \(\sigma\) is? How is this possible? How do we do this?

Example 2-2: Average Weight Gain of Pigs Section

A farm has 1000 young pigs with an initial weight of about 50 lbs. They put them on a new diet for 3 weeks and want to know how many pigs to sample so that they can estimate the average weight gain. They want the answer to be within 2 lbs. with 90% confidence.

There is no pilot data here. We don't have the time to select some pigs in order to get an estimate for \(\sigma\), the standard deviation of the weight gain.

Question: How do we get a rough estimate of \(\sigma\)?

What would be a reasonable measure that would help this farmer to give him some guidance on how to estimate the standard deviation of the weight gain?

One thing we can do is rely on the information that we already have, i.e., find some historical data that exists on this topic. But what if this historical data does not exist?

Try it!

Could we find a rough estimate for \(\sigma\)?

For certain variables, we can make reasonable guesses for an estimate of \(\sigma\). Here is a formula for this rough estimate:

\(\sigma \approx \frac{Range}{4}\)

The range is relatively easy to have some idea about. This is an important point. Even though perhaps none of us has raised pigs we can still come up with a sensible guess. So, for this case, we will make a sensible guess of the range of weight gain and intuitively estimate this to be from a minimum of 10 lbs to a maximum of 50 lbs within this 3-week period.

\(\sigma\) can now be roughly estimated to be:

\(\dfrac{Range}{4}=\dfrac{50-10}{4}=10\qquad lbs\)

Now we can use the formula for estimating the mean, \(\mu\). Then,

 n & = \dfrac{1}{\dfrac{ d^2}{ z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}} \\
 & = \dfrac{1}{\dfrac{ 2^2}{ (1.645)^2 \cdot (10)^2}+\dfrac{1}{1000}} \\
 & = 63.36 \\

Round up to 64.


We will need to sample 64 pigs in order to estimate the average weight gain in 3 weeks to within 2 lbs. with a 90% confidence interval.