How large is a sample size that is large enough for estimating the population mean?

If \(\hat{\theta}\) is an unbiased, normally distributed estimator of \(\theta\), then

\(\dfrac{\hat{\theta}-\theta}{\sqrt{Var(\hat{\theta})}} \sim N(0,1)\)

Then \(P\left(\dfrac{|\hat{\theta}-\theta|}{\sqrt{Var(\hat{\theta})}} > z_{\alpha/2} \right)=\alpha \)

\( P\left(|\hat{\theta}-\theta|>z_{\alpha/2} \cdot \sqrt{Var(\hat{\theta})} \right)= \alpha \)

**Note!**because we know that \(\hat{\theta}\) is normal, we can thus use the

*z*distribution.

And, if we specify this \(\alpha\) we can then try to find out the sample size large enough to achieve the goal of your experiment.

So, we need to ask, "What is the goal of your experiment?" This is perhaps the most important question asked as a part of your experiment.

**Example**: What if we were interested in estimating the average weight of Penn State male students? How many samples should we plan on taking? We want to estimate this mean. What do we need to consider?

- The variability of the data and the measure that you are estimating is your first concern. This directly affects how many samples you will need.
- The second thing that you need to think about is the type of conclusion that you would like to report. That is, you need to specify the \(1 - \alpha\) value that you are happy with.
- How accurate (precision) do you want this estimate to be? You thus need to specify the margin of error.

Now, if we specify \(1-\alpha\), the margin of error *d* (also can be viewed as the half-width of the \((1 - \alpha)\)100% CI), we can solve for the sample size such that the CI has the specified margin of error.

For **estimating the population mean**, the equation becomes:

\(P\left(|\bar{x}-\mu|>z_{\alpha/2} \cdot \sqrt{\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n}}\right)=\alpha\)

\(z_{\alpha/2}\sqrt{\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n}}=d\)

\(n=\dfrac{1}{\dfrac{ d^2}{ z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\)

Can we now use this formula to estimate the sample size?

What is the weak point of this formula? The weak point is the estimate of the population variance used. We do not know what this is!

Similarly, for **estimating the population total** \(\tau\), here is the formula:

\(P\left(|\hat{\tau}-\tau|>z_{\alpha/2} \cdot \sqrt{N(N-n)\dfrac{\sigma^2}{n}} \right)=\alpha\)

\(z_{\alpha/2}\sqrt{N(N-n)\dfrac{\sigma^2}{n}}=d\)

\(n=\dfrac{1}{\dfrac{ d^2}{ N^2 \cdot z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\)

##
Example 2-1: Beetles - Sample size
Section* *

What sample size is needed to estimate the population total, \(\tau\), to within *d* = 1000 with a 95% CI?

Now, let's begin plugging what we know into the formula. We know *N* = 100, \(\alpha\) = 0.05. Do we know \(\sigma^2\)? No, but we can estimate \(\sigma^2\) by \(s^2\) = 1932.657.

How many should we sample? Let's calculate this out:

\(n=\dfrac{1}{\dfrac{ d^2}{ N^2 \cdot z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}}\)

\(n=\dfrac{1}{\dfrac{ (1000)^2}{ (100)^2 \cdot (1.96)^2 \cdot 1932.657}+\dfrac{1}{100}}=42.610\)

We will always round this up, therefore, we will sample 43 of the 100 plots.

**Note!** If we ignore the finite population correction adjustment then,

\begin{align}

n & = \dfrac{N^2 \cdot z^2_{\alpha/2} \cdot \sigma^2}{d^2} \\

& = \dfrac{(100)^2 \cdot (1.96)^2 \cdot 1932.657}{(1000)^2} \\

& = 74.245 \\

\end{align}

which rounds up to 75. This value is much larger than 43.

### Try it!

In this first example, *N* = 100 is not very large compared to *n*, so one should not ignore the finite population adjustment!

But wait a minute, should you have any cause for concern about the answer, *n* = 43, that we obtained?

What about the value that we used for \(\sigma^2\), (1932.657)?

**Note!**\(\sigma^2\) is not 1932.657. 1932.657 is the sample variance. Using z in the formula may be too aggressive. Sometimes people use

*t*iteratively.

Let's take a look at this more iterative method:

\(n=\dfrac{1}{\dfrac{d^2}{N^2\cdot t^2 \cdot s^2}+\dfrac{1}{N}}\)

Complication: *t* values depend on *n*. First we will use *n* = 43, and the *t* for *df* = 42 is 2.0181

\(n=\dfrac{1}{\dfrac{(1000)^2}{(100)^2\cdot (2.0181)^2 \cdot 1932.657}+\dfrac{1}{100}}=44.044\)

Round up to 45, *t* for 44 df is 2.0154.

\(n=\dfrac{1}{\dfrac{(1000)^2}{(100)^2\cdot (2.0154)^2 \cdot 1932.657}+\dfrac{1}{100}}=43.978\)

Here, we get *n* = 44. So, we see that the conservative answer is to take n = 45.

Consequently, our final answer will be to take 45 samples.

In the beetle example, there are data to estimate \(\sigma^2\). What can one do if there is no pilot data? How can we get a rough idea about what \(\sigma\) is? How is this possible? How do we do this?

##
Example 2-2: Average Weight Gain of Pigs
Section* *

A farm has 1000 young pigs with an initial weight of about 50 lbs. They put them on a new diet for 3 weeks and want to know how many pigs to sample so that they can estimate the average weight gain. They want the answer to be within 2 lbs. with 90% confidence.

There is no pilot data here. We don't have the time to select some pigs in order to get an estimate for \(\sigma\), the standard deviation of the weight gain.

**Question**: How do we get a rough estimate of \(\sigma\)?

What would be a reasonable measure that would help this farmer to give him some guidance on how to estimate the standard deviation of the weight gain?

One thing we can do is rely on the information that we already have, i.e., find some historical data that exists on this topic. But what if this historical data does not exist?

### Try it!

For certain variables, we can make reasonable guesses for an estimate of \(\sigma\). Here is a formula for this rough estimate:

\(\sigma \approx \frac{Range}{4}\)

The range is relatively easy to have some idea about. This is an important point. Even though perhaps none of us has raised pigs we can still come up with a sensible guess. So, for this case, we will make a sensible guess of the range of weight gain and intuitively estimate this to be from a minimum of 10 lbs to a maximum of 50 lbs within this 3-week period.

\(\sigma\) can now be roughly estimated to be:

\(\dfrac{Range}{4}=\dfrac{50-10}{4}=10\qquad lbs\)

Now we can use the formula for estimating the mean, \(\mu\). Then,

\begin{align}

n & = \dfrac{1}{\dfrac{ d^2}{ z^2_{\alpha/2}\cdot \sigma^2}+\dfrac{1}{N}} \\

& = \dfrac{1}{\dfrac{ 2^2}{ (1.645)^2 \cdot (10)^2}+\dfrac{1}{1000}} \\

& = 63.36 \\

\end{align}

Round up to 64.

#### Answer

We will need to sample 64 pigs in order to estimate the average weight gain in 3 weeks to within 2 lbs. with a 90% confidence interval.