6.3 - Estimating a Proportion for a Small, Finite Population

6.3 - Estimating a Proportion for a Small, Finite Population

The methods of the last page, in which we derived a formula for the sample size necessary for estimating a population proportion \(p\) work just fine when the population in question is very large. When we have smaller, finite populations, however, such as the students in a high school or the residents of a small town, the formula we derived previously requires a slight modification. Let's start, as usual, by taking a look at an example.

Example 6-3

rural india woman

A researcher is studying the population of a small town in India of \(N=2000\) people. She's interested in estimating \(p\) for several yes/no questions on a survey.

How many people \(n\) does she have to randomly sample (without replacement) to ensure that her estimates \(\hat{p}\) are within \(\epsilon=0.04\) of the true proportions \(p\)?

Answer

We can't even begin to address the answer to this question until we derive a confidence interval for a proportion for a small, finite population!

Theorem

An approximate (\((1-\alpha)100\%\) confidence interval for a proportion \(p\) of a small population is:

\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} \cdot \dfrac{N-n}{N-1}}\)

Proof

We'll use the example above, where possible, to make the proof more concrete. Suppose we take a random sample, \(X_1, X_2, \ldots, X_n\), without replacement, of size \(n\) from a population of size \(N\). In the case of the example, \(N=2000\). Suppose also, unknown to us, that for a particular survey question there are \(N_1\) respondents who would respond "yes" to the question, and therefore \(N-N_1\) respondents who would respond "no." That is, our small finite population looks like this:

If that's the case, the true proportion (but unknown to us) of yes respondents is:

\(p=P(Yes)=\dfrac{N_1}{N}\)

while the true proportion (but unknown to us) of no respondents is:

\(1-p=P(No)=1-\dfrac{N_1}{N}=\dfrac{N-N_1}{N}\)

Now, let \(X\) denote the number of respondents in the sample who say yes, so that:

\(X=\sum\limits_{i=1}^n X_i\)

if \(X_i=1\) if respondent \(i\) answers yes, and \(X_i=0\) if respondent \(i\) answers no. Then, the proportion in the sample who say yes is:

\(\hat{p}=\dfrac{\sum\limits_{i=1}^n X_i}{n}\)

Then, \(X=\sum\limits_{i=1}^n X_i\) is a hypergeometric random variable with mean:

\(E(X)=n\dfrac{N_1}{N}=np\)

and variance: $$Var(X)=n{N_1\over N}\left(1-{N_1\over N}\right) \left({N-n\over N-1}\right)=np(1-p)\left({N-n\over N-1}\right)$$

It follows that \(\hat{p}=X/n\) has mean \(E(\hat{p})=p\) and variance:

\(Var(\hat{p})=\dfrac{p(1-p)}{n}\left(\dfrac{N-n}{N-1}\right)\)

Then, the Central Limit Theorem tells us that:

\(\dfrac{\hat{p}-p}{\sqrt{\dfrac{p(1-p)}{n} \left(\dfrac{N-n}{N-1}\right) }}\)

follows an approximate standard normal distribution. Now, it's just a matter of doing the typical confidence interval derivation, in which we start with a probability statement, manipulate the quantity inside the parentheses, and substitute sample estimates where necessary. We've done that a number of times now, so skipping all of the details here, we get that an approximate \((1-\alpha)100\%\) confidence interval for \(p\) of a small population is:

\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} \cdot \dfrac{N-n}{N-1}}\)

By the way, it is worthwhile noting that if the sample \(n\) is much smaller than the population size \(N\), that is, if \(n<<N\), then:

\(\dfrac{N-n}{N-1}\approx 1\)

and the confidence interval for \(p\) of a small population becomes quite similar to the confidence interval for \(p\) of a large population:

\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\)

Example 6-3 (continued)

india village

A researcher is studying the population of a small town in India of \(N=2000\) people. She's interested in estimating \(p\) for several yes/no questions on a survey.

How many people \(n\) does she have to randomly sample (without replacement) to ensure that her estimates \(\hat{p}\) are within \(\epsilon=0.04\) of the true proportion \(p\)?

Answer

Now that we know the correct formula for the confidence interval for \(p\) of a small population, we can follow the same procedure we did for determining the sample size for estimating a proportion \(p\) of a large population. The researcher's goal is to estimate \(p\) so that the error is no larger than 0.04. That is, the goal is to calculate a 95% confidence interval such that:

\(\hat{p}\pm \epsilon=\hat{p}\pm 0.04\)

Now, we know the formula for an approximate \((1-\alpha)100\%\) confidence interval for a proportion \(p\) of a small population is:

\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} \cdot \dfrac{N-n}{N-1}}\)

So, again, we should proceed by equating the terms appearing after each of the above \(\pm\) signs, and solving for \(n\). That is, equate:

\(\epsilon=z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}\cdot \dfrac{N-n}{N-1}}\)

and solve for \(n\). Doing the algebra yields:

\(n=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})/\epsilon^2}{\dfrac{N-1}{N}+\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{N\epsilon^2}}\)

That looks simply dreadful! Let's make it look a little more friendly to the eyes:

\(n=\dfrac{m}{1+\dfrac{m-1}{N}}\)

where \(m\) is defined as the sample size necessary for estimating the proportion \(p\) for a large population, that is, when a correction for the population being small and finite is not made. That is:

\(m=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{\epsilon^2}\)

Now, before we make the calculation for our particular example, let's take a step back and summarize what we have just learned.

Estimating a population proportion \(p\) of a small finite population

The sample size necessary for estimating a population proportion \(p\) of a small finite population with \((1-\alpha)100\%\) confidence and error no larger than \(\epsilon\) is:

\(n=\dfrac{m}{1+\dfrac{m-1}{N}}\)

where:

\(m=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{\epsilon^2}\)

is the sample size necessary for estimating the proportion \(p\) for a large population.

Example 6-3 (continued)

Rurual indian house

A researcher is studying the population of a small town in India of \(N=2000\) people. She's interested in estimating \(p\) for several yes/no questions on a survey.

How many people \(n\) does she have to randomly sample (without replacement) to ensure that her estimates \(\hat{p}\) are within \(\epsilon=0.04\) of the true proportion \(p\)?

Answer

Okay, once and for all, let's calculate this very patient researcher's sample size! Because the researcher has many different questions on the survey, it would behoove her to use a sample proportion of 0.50 in her calculations. If the maximum error \(\epsilon\) is 0.04, the sample proportion is 0.5, and the researcher doesn't make the finite population correction, then she needs:

\(m=\dfrac{(1.96^2)(\frac{1}{4})}{0.04^2}=600.25\)

or 601 people to estimate \(p\) with 95% confidence. But, upon making the correction for the small, finite population, we see that the researcher really only needs:

\(n=\dfrac{m}{1+\dfrac{m-1}{N}}=\dfrac{601}{1+\dfrac{601-1}{2000}}=462.3\)

or 463 people to estimate \(p\) with 95% confidence.

Effect of Population Size \(N\)

The following table illustrates how the sample size \(n\) that is necessary for estimating a population proportion \(p\) (with 95% confidence) is affected by the size of the population \(N\). If \(\hat{p}=0.5\), then the sample size \(n\) is:

\( \hat{p} = 0.5\) \( \large \epsilon \)= 0.01 \( \large \epsilon \)= 0.03 \( \large \epsilon \)= 0.05
N very large 9604 1068 385
N = 10, 000, 000 9595 1068 385
N = 1, 000, 000 9513 1067 385
N = 100, 000 8763 1057 384
N = 10, 000 4900 966 371
N = 1, 000 906 517 279

This table suggests, perhaps not surprisingly, that as the size of the population \(N\) decreases, so does the necessary size \(n\) of the sample.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility