Lesson 6: Sample Size
Lesson 6: Sample SizeOverview
So far, in this section, we have focused on using a random sample of size \(n\) to find an interval estimate for a variety of population parameters, including a mean \(\mu\), a proportion \(p\), and a standard deviation \(\sigma\). In none of our discussions did we talk about how large a sample should be in order to ensure that the interval estimate we obtain is narrow enough to be worthwhile. That's what we'll do in this lesson!
Objectives
- derive a formula for the sample size, \(n\), necessary for estimating the population mean \(\mu\)
- derive a formula for the sample size, \(n\), necessary for estimating a proportion \(p\) for a large population
- derive a formula for the sample size, \(n\), necessary for estimating a proportion \(p\) for a small, finite population
The methods that we use here in deriving the formulas could be easily applied to the estimation of other population parameters as well.
6.1 - Estimating a Mean
6.1 - Estimating a MeanExample 6.1
A researcher wants to estimate \(\mu\), the mean systolic blood pressure of adult Americans, with 95% confidence and error \(\epsilon\) no larger than 3 mm Hg. How many adult Americans, \(n\), should the researcher randomly sample to achieve her estimation goal?
Answer
The researcher's goal is to estimate \(\mu\) so that the error is no larger than 3 mm Hg. (By the way, \(\epsilon\) is typically called the maximum error of the estimate.) That is, her goal is to calculate a 95% confidence interval such that:
\(\bar{x}\pm \epsilon=\bar{x}\pm 3\)
Now, we know the formula for a \((1-\alpha)100\%\) confidence interval for a population mean \(\mu\) is:
\(\bar{x}\pm t_{\alpha/2,n-1}\left(\dfrac{s}{\sqrt{n}}\right)\)
So, it seems that a reasonable way to proceed would be to equate the terms appearing after each of the above \(\pm\) signs, and solve for \(n\). That is, equate:
\(\epsilon=t_{\alpha/2,n-1}\left(\dfrac{s}{\sqrt{n}}\right)\)
and solve for \(n\). Multiplying through by the square root of \(n\), we get:
\(\epsilon \sqrt{n}=t_{\alpha/2,n-1}(s)\)
And, dividing through by \(\epsilon\) and squaring both sides, we get:
\(n=\dfrac{(t_{\alpha/2,n-1})^2 s^2}{\epsilon^2}\)
Now, what's wrong with the formula we derived? Well... the \(t\)-value on the right side of the equation depends on \(n\).
That's not particularly helpful given that we are trying to find \(n\)! We can solve that problem by simply replacing the \(t\)-value that depends on \(n\) with a \(Z\)-value that doesn't. After all, you might recall that as \(n\) increases, the \(t\)-distribution approaches the standard normal distribution. Doing so, we get:
\(n \approx \dfrac{(z^2_{\alpha/2})s^2}{\epsilon^2}\)
Before we make the calculation for our particular example, let's take a step back and summarize what we have just learned.
- Estimating a population mean \(\mu\)
-
The sample size necessary for estimating a population mean \(\mu\) with \((1-\alpha)100\%\) confidence and error no larger than \(\epsilon\) is:
\(n = \dfrac{(z^2_{\alpha/2})s^2}{\epsilon^2}\)
Typically, the hardest part of determining the necessary sample size is finding \(s^2\), that is, a decent estimate of the population variance. There are a few ways of obtaining \(s^2\).
Ways to Determine \(s^2\)
-
You can often get \(s^2\), an estimate of the population variance from the scientific literature. After all, scientific research is typically not done in a vacuum. That is, what one researcher is studying and reporting in scientific journals is typically also studied and reported by several other researchers in various locations around the world. If you're in need of an estimate of the variance of the front leg length of red-eyed tree frogs, you'll probably be able to find it in a research paper reported in some scientific journal.
-
You can often get \(s^2\), an estimate of the population variance by conducting a small pilot study on 5-10 people (or trees or snakes or... whatever you're measuring).
-
You can often get \(s^2\), an estimate of the population variance by using what we know about the Empirical Rule, which states that we can expect 95% of the observations to fall in the interval:
\(\bar{x}\pm 2s\)
Here's a picture that illustrates how this part of the Empirical Rule can help us determine a reasonable value of \(s\):
That is, we could define the range of values as that which captures 95% of the measurements. If we do that, then we can work backwards to see that s can be determined by dividing the range by 4. That is:
\(s=\dfrac{Range}{4}=\dfrac{Max-Min}{4}\)
When statisticians use the Empirical Rule to help a researcher arrive at a reasonable value of \(s\), they almost always use the above formula. That said, there may be occasion in which it is worthwhile using another part of the Empirical Rule, namely that we can expect 99.7% of the observations to fall in the interval:
\(\bar{x}\pm 3s\)
Here's a picture that illustrates how this part of the Empirical Rule can help us determine a reasonable value of \(s\):
In this case, we could define the range of values as that which captures 99.7% of the measurements. If we do that, then we can work backwards to see that \(s\) can be determined by dividing the range by 6. That is:
\(s=\dfrac{Range}{6}=\dfrac{Max-Min}{6}\)
Example 6-1 (Continued)
A researcher wants to estimate \(\mu\), the mean systolic blood pressure of adult Americans, with 95% confidence and error \(\epsilon\) no larger than 3 mm Hg. How many adult Americans, \(n\), should the researcher randomly sample to achieve her estimation goal?
Answer
If the maximum error \(\epsilon\) is 3, and the sample variance is \(s^2=10^2\), we need:
\(n=\dfrac{(1.96)^2(10)^2}{3^2}=42.7\)
or 43 people to estimate \(\mu\) with 95% confidence. In general, when making sample size calculations such at this one, it is a good idea to change all of the factors to see what the "cost" in sample size is for achieving certain errors \(\epsilon\) and confidence levels \((1-\alpha)\). Doing that here, we get:
\(s^2 = 10^2\) | \( \epsilon \)= 1 | \( \epsilon \)= 3 | \( \epsilon \)= 5 |
---|---|---|---|
90% \((z_{0.05} = 1.645)\) | 271 | 31 | 11 |
95% \((z_{0.025} = 1.96)\) | 385 | 43 | 16 |
99% \((z_{0.005} = 2.576)\) | 664 | 74 | 27 |
We can also change the estimate of the variance. For example, if we change the sample variance to \(s^2=8^2\), then the necessary sample sizes for various errors \(\epsilon\) and confidence levels \((1-\alpha)\) become:
\(s^2 = 8^2\) | \( \epsilon \)= 1 | \( \epsilon \)= 3 | \( \epsilon \)= 5 |
---|---|---|---|
90% \((z_{0.05} = 1.645)\) | 174 | 20 | 7 |
95% \((z_{0.025} = 1.96)\) | 246 | 28 | 10 |
99% \((z_{0.005} = 2.576)\) | 425 | 48 | 17 |
Factors Affecting the Sample Size
If we take a look back at the formula for the sample size:
\(n =\dfrac{(z^2_{\alpha/2})s^2}{\epsilon^2}\)
we can make some generalizations about how each of three factors, namely the standard deviation s, the confidence level \((1-\alpha)100\%\), and the error \(\epsilon\), affect the necessary sample size.
As the confidence level \((1-\alpha)100\%\) increases, the necessary sample size increases. That's because as the confidence level increases, the \(Z\)-value, which appears in the numerator of the formula, increases. Again, you can see an example of this generalization from some of the numbers generated in that last example:
-
As the error \(\epsilon\) decreases, the necessary sample size \(n\) increases. That's because the error \(epsilon\) term appears in the denominator. You can see an example of this generalization from some of the numbers generated in that last example:
Hover over the icon to see further explanation
\(s^2 = 10^2\) \( \epsilon \)= 1 \( \epsilon \)= 3 \( \epsilon \)= 5 90% \((z_{0.05} = 1.645)\) 271 31 11 95% \((z_{0.025} = 1.96)\) 385 43 16 99% \((z_{0.005} = 2.576)\) 664 74 27 -
As the confidence level \((1-\alpha)100\%\) increases, the necessary sample size increases. That's because as the confidence level increases, the \(Z\)-value, which appears in the numerator of the formula, increases. Again, you can see an example of this generalization from some of the numbers generated in that last example:
Hover over the icon to see further explanation
\( \epsilon \)= 1 \( \epsilon \)= 3 \( \epsilon \)= 5 90% \((z_{0.05} = 1.645)\) 174 20 7 95% \((z_{0.025} = 1.96)\) 246 28 10 99% \((z_{0.005} = 2.576)\) 425 48 17 -
As the sample standard deviation \(s\) increases, the necessary sample size increases. That's because the standard deviation s appears in the numerator of the formula. Again, you can see an example of this generalization from some of the numbers generated in that last example:
\(s^2 = 10^2\) \( \epsilon \)= 1 \( \epsilon \)= 3 \( \epsilon \)= 5 90% \((z_{0.05} = 1.645)\) 271 31 11 95% \((z_{0.025} = 1.96)\) 385 43 16 99% \((z_{0.005} = 2.576)\) 664 74 27 \(s^2 = 8^2\) \( \epsilon \)= 1 \( \epsilon \)= 3 \( \epsilon \)= 5 90% \((z_{0.05} = 1.645)\) 174 20 7 95% \((z_{0.025} = 1.96)\) 246 28 10 99% \((z_{0.005} = 2.576)\) 425 48 17
6.2 - Estimating a Proportion for a Large Population
6.2 - Estimating a Proportion for a Large PopulationExample 6-2
A pollster wants to estimate \(p\), the true proportion of all Americans favoring the Democratic candidate with 95% confidence and error \(\epsilon\) no larger than 0.03.
How many people should he randomly sample to achieve his goals?
Answer
We'll tackle this problem just as we did for finding the sample size necessary to estimate a population mean. First, note that the pollster's goal is to estimate the population proportion \(p\) so that the error is no larger than 0.03. That is, the goal is to calculate a 95% confidence interval such that:
\(\hat{p}\pm \epsilon=\hat{p}\pm 0.03\)
But, we know the formula for a \((1-\alpha)100\%\) confidence interval for a population proportion is:
\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\)
So, just as we did on the previous page, we'll proceed by equating the terms appearing after each of the above \(\pm\) signs, and solve for \(n\). That is, equate:
\(\epsilon=z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\)
and solve for \(n\). Multiplying through by the square root of \(n\), we get:
\(\epsilon \sqrt{n}=z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})}\)
And, dividing through by \(\epsilon\) and squaring both sides, we get:
\(n=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{\epsilon^2}\)
Again, before we make the calculation for our particular example, let's take a step back and summarize the formula that we have just derived.
- Estimating a population proportion \(p\)
-
The sample size necessary for estimating a population proportion \(p\) of a large population with ((1-\alpha)100\%\) confidence and error no larger than \(\epsilon\) is:
\(n=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{\epsilon^2}\)
Just as we needed to have a decent estimate, \(s^2\), of the population variance when calculating the sample size necessary for estimating a population mean \(\mu\), we need to have a good estimate, \(\hat{p}\), of the population proportion when calculating the sample size necessary for estimating a population proportion \(p\). Strange, I know... but there are at least two ways out of this conundrum.
Ways to Determine \(\hat{p}(1-\hat{p})\)
-
You can use your prior knowledge (previous polls, perhaps?) about \(\hat{p}\).
-
You can set \(\hat{p}(1-\hat{p})=\dfrac{1}{4}\) , its maximum when \(\hat{p}=\dfrac{1}{2}\)
Example 6-2 (Continued)
A pollster wants to estimate \(p\), the true proportion of all Americans favoring the Democratic candidate with 95% confidence and error \(\epsilon\) no larger than 0.03.
How many people should he randomly sample to achieve his goals?
Answer
If the maximum error \(\epsilon\) is 0.03, and the sample proportion is 0.8, we need to survey:
\(n=\dfrac{(1.96)^2(0.8)(0.2)}{0.03^2}=682.95\)
or 683 people to estimate \(p\) with 95% confidence. Again, when making sample size calculations such at this one, it is a good idea to change all of the factors to see what the "cost" is in sample size for achieving certain errors \(\epsilon\) and confidence levels \((1-\alpha)\). Doing that here, we get:
\( \hat{p} = 0.8\) | \( \epsilon \)= 1 | \( \epsilon \)= 3 | \( \epsilon \)= 5 |
---|---|---|---|
90% \((z_{0.05} = 1.645)\) | 4330 | 482 | 174 |
95% \((z_{0.025} = 1.96)\) | 6147 | 683 | 246 |
99% \((z_{0.005} = 2.576)\) | 10618 | 1180 | 425 |
We, of course, can also change the sample proportion. For example, if we change the sample proportion to 0.5, then we need to survey:
\(n=\dfrac{(1.96)^2(0.5)(0.5)}{0.03^2}=1067.1\)
or 1068 people to estimate \(p\) with 95% confidence. The two calculations in this example illustrate how useful it is to have some idea of the magnitude of the sample proportion. In one case, if the proportion is close to 0.80, then we'd need as few as 680 people. On the other hand, if the proportion is close to 0.50, then we'd need as many as 1070 people. That difference in necessary sample size sure argues for a small pilot study in advance of the larger survey.
By the way, just as we did for the case in which the sample proportion was 0.8, we can change the factors to see what the "cost" is in sample size for achieving certain errors \(\epsilon\) and confidence levels \((1-\alpha)\). Doing that here, we get:
\( \hat{p} = 0.5\) | \( \epsilon \)= 1 | \( \epsilon \)= 3 | \( \epsilon \)= 5 |
---|---|---|---|
90% \((z_{0.05} = 1.645)\) | 6766 | 752 | 271 |
95% \((z_{0.025} = 1.96)\) | 9604 | 1068 | 385 |
99% \((z_{0.005} = 2.576)\) | 16590 | 1844 | 664 |
6.3 - Estimating a Proportion for a Small, Finite Population
6.3 - Estimating a Proportion for a Small, Finite PopulationThe methods of the last page, in which we derived a formula for the sample size necessary for estimating a population proportion \(p\) work just fine when the population in question is very large. When we have smaller, finite populations, however, such as the students in a high school or the residents of a small town, the formula we derived previously requires a slight modification. Let's start, as usual, by taking a look at an example.
Example 6-3
A researcher is studying the population of a small town in India of \(N=2000\) people. She's interested in estimating \(p\) for several yes/no questions on a survey.
How many people \(n\) does she have to randomly sample (without replacement) to ensure that her estimates \(\hat{p}\) are within \(\epsilon=0.04\) of the true proportions \(p\)?
Answer
We can't even begin to address the answer to this question until we derive a confidence interval for a proportion for a small, finite population!
An approximate (\((1-\alpha)100\%\) confidence interval for a proportion \(p\) of a small population is:
\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} \cdot \dfrac{N-n}{N-1}}\)
Proof
We'll use the example above, where possible, to make the proof more concrete. Suppose we take a random sample, \(X_1, X_2, \ldots, X_n\), without replacement, of size \(n\) from a population of size \(N\). In the case of the example, \(N=2000\). Suppose also, unknown to us, that for a particular survey question there are \(N_1\) respondents who would respond "yes" to the question, and therefore \(N-N_1\) respondents who would respond "no." That is, our small finite population looks like this:
If that's the case, the true proportion (but unknown to us) of yes respondents is:
\(p=P(Yes)=\dfrac{N_1}{N}\)
while the true proportion (but unknown to us) of no respondents is:
\(1-p=P(No)=1-\dfrac{N_1}{N}=\dfrac{N-N_1}{N}\)
Now, let \(X\) denote the number of respondents in the sample who say yes, so that:
\(X=\sum\limits_{i=1}^n X_i\)
if \(X_i=1\) if respondent \(i\) answers yes, and \(X_i=0\) if respondent \(i\) answers no. Then, the proportion in the sample who say yes is:
\(\hat{p}=\dfrac{\sum\limits_{i=1}^n X_i}{n}\)
Then, \(X=\sum\limits_{i=1}^n X_i\) is a hypergeometric random variable with mean:
\(E(X)=n\dfrac{N_1}{N}=np\)
and variance: $$Var(X)=n{N_1\over N}\left(1-{N_1\over N}\right) \left({N-n\over N-1}\right)=np(1-p)\left({N-n\over N-1}\right)$$
It follows that \(\hat{p}=X/n\) has mean \(E(\hat{p})=p\) and variance:
\(Var(\hat{p})=\dfrac{p(1-p)}{n}\left(\dfrac{N-n}{N-1}\right)\)
Then, the Central Limit Theorem tells us that:
\(\dfrac{\hat{p}-p}{\sqrt{\dfrac{p(1-p)}{n} \left(\dfrac{N-n}{N-1}\right) }}\)
follows an approximate standard normal distribution. Now, it's just a matter of doing the typical confidence interval derivation, in which we start with a probability statement, manipulate the quantity inside the parentheses, and substitute sample estimates where necessary. We've done that a number of times now, so skipping all of the details here, we get that an approximate \((1-\alpha)100\%\) confidence interval for \(p\) of a small population is:
\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} \cdot \dfrac{N-n}{N-1}}\)
By the way, it is worthwhile noting that if the sample \(n\) is much smaller than the population size \(N\), that is, if \(n<<N\), then:
\(\dfrac{N-n}{N-1}\approx 1\)
and the confidence interval for \(p\) of a small population becomes quite similar to the confidence interval for \(p\) of a large population:
\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\)
Example 6-3 (continued)
A researcher is studying the population of a small town in India of \(N=2000\) people. She's interested in estimating \(p\) for several yes/no questions on a survey.
How many people \(n\) does she have to randomly sample (without replacement) to ensure that her estimates \(\hat{p}\) are within \(\epsilon=0.04\) of the true proportion \(p\)?
Answer
Now that we know the correct formula for the confidence interval for \(p\) of a small population, we can follow the same procedure we did for determining the sample size for estimating a proportion \(p\) of a large population. The researcher's goal is to estimate \(p\) so that the error is no larger than 0.04. That is, the goal is to calculate a 95% confidence interval such that:
\(\hat{p}\pm \epsilon=\hat{p}\pm 0.04\)
Now, we know the formula for an approximate \((1-\alpha)100\%\) confidence interval for a proportion \(p\) of a small population is:
\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} \cdot \dfrac{N-n}{N-1}}\)
So, again, we should proceed by equating the terms appearing after each of the above \(\pm\) signs, and solving for \(n\). That is, equate:
\(\epsilon=z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}\cdot \dfrac{N-n}{N-1}}\)and solve for \(n\). Doing the algebra yields:
\(n=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})/\epsilon^2}{\dfrac{N-1}{N}+\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{N\epsilon^2}}\)
That looks simply dreadful! Let's make it look a little more friendly to the eyes:
\(n=\dfrac{m}{1+\dfrac{m-1}{N}}\)
where \(m\) is defined as the sample size necessary for estimating the proportion \(p\) for a large population, that is, when a correction for the population being small and finite is not made. That is:
\(m=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{\epsilon^2}\)
Now, before we make the calculation for our particular example, let's take a step back and summarize what we have just learned.
- Estimating a population proportion \(p\) of a small finite population
-
The sample size necessary for estimating a population proportion \(p\) of a small finite population with \((1-\alpha)100\%\) confidence and error no larger than \(\epsilon\) is:
\(n=\dfrac{m}{1+\dfrac{m-1}{N}}\)
where:
\(m=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{\epsilon^2}\)
is the sample size necessary for estimating the proportion \(p\) for a large population.
Example 6-3 (continued)
A researcher is studying the population of a small town in India of \(N=2000\) people. She's interested in estimating \(p\) for several yes/no questions on a survey.
How many people \(n\) does she have to randomly sample (without replacement) to ensure that her estimates \(\hat{p}\) are within \(\epsilon=0.04\) of the true proportion \(p\)?
Answer
Okay, once and for all, let's calculate this very patient researcher's sample size! Because the researcher has many different questions on the survey, it would behoove her to use a sample proportion of 0.50 in her calculations. If the maximum error \(\epsilon\) is 0.04, the sample proportion is 0.5, and the researcher doesn't make the finite population correction, then she needs:
\(m=\dfrac{(1.96^2)(\frac{1}{4})}{0.04^2}=600.25\)
or 601 people to estimate \(p\) with 95% confidence. But, upon making the correction for the small, finite population, we see that the researcher really only needs:
\(n=\dfrac{m}{1+\dfrac{m-1}{N}}=\dfrac{601}{1+\dfrac{601-1}{2000}}=462.3\)
or 463 people to estimate \(p\) with 95% confidence.
Effect of Population Size \(N\)
The following table illustrates how the sample size \(n\) that is necessary for estimating a population proportion \(p\) (with 95% confidence) is affected by the size of the population \(N\). If \(\hat{p}=0.5\), then the sample size \(n\) is:
\( \hat{p} = 0.5\) | \( \large \epsilon \)= 0.01 | \( \large \epsilon \)= 0.03 | \( \large \epsilon \)= 0.05 |
---|---|---|---|
N very large | 9604 | 1068 | 385 |
N = 10, 000, 000 | 9595 | 1068 | 385 |
N = 1, 000, 000 | 9513 | 1067 | 385 |
N = 100, 000 | 8763 | 1057 | 384 |
N = 10, 000 | 4900 | 966 | 371 |
N = 1, 000 | 906 | 517 | 279 |
This table suggests, perhaps not surprisingly, that as the size of the population \(N\) decreases, so does the necessary size \(n\) of the sample.