4.2 - Sampling Distribution of the Sample Proportion

Before we begin, let’s make sure we review the terms and notation associated with proportions:

$p$ is the population proportion. It is a fixed value.
$n$ is the size of the random sample.
$\hat{p}$ is the sample proportion. It varies based on the sample.

The following example will illustrate how to find the sampling distribution for an example where the population is small.

Sample Proportions with a Small Population: Favorite Color

Decorative banner image of colored pencils.

In a particular family, there are five children. Their names are Alex (A), Betina (B), Carly (C), Debbie (D), and Edward (E). The table below shows the child’s name and their favorite color.

Name	Alex (A)	Betina (B)	Carly (C)	Debbie (D)	Edward (E)
Color	Green	Blue	Yellow	Purple	Blue

We are interested in the proportion of children in the family who prefer the color blue, and from the table, we can see that $p = .40$ of the children prefer blue.

Similar to the pumpkin example earlier in the lesson, let's say we didn't know the proportion of children who like blue as their favorite color. We'll use resampling methods to estimate the proportion. Let’s take $n=2$ repeated samples, taken without replacement. Here are all the possible samples of size $n=2$ and their respective probabilities of the proportion of children who like blue.

Sample	P(Blue)	Probability
AB	1/2	1/10
AC	0	1/10
AD	0	1/10
AE	1/2	1/10
BC	1/2	1/10
BD	1/2	1/10
BE	1	1/10
CD	0	1/10
CE	1/2	1/10
DE	1/2	1/10

The probability mass function (PMF) is:

P(Blue)	0	1/2	1
Probability	3/10	6/10	1/10

The graph of the PMF:

Sampling Distribution of P(Blue)

The true proportion is $p=P(Blue)=\frac{2}{5}$. When the sample size is $n=2$, you can see from the PMF, it is not possible to get a sampling proportion that is equal to the true proportion.

Although not presented in detail here, we could find the sampling distribution for a larger sample size, say $n=4$. The PMF for n=4 is...

P(Blue)	1/4	1/2
Probability	2/5	3/5

As with the sampling distribution of the sample mean, the sampling distribution of the sample proportion will have sampling error. It is also the case that the larger the sample size, the smaller the spread of the distribution.

Example 4-3 Resampling with StatKey

Using StatKey, we resample 1000 times from populations that have probabilities of success, 0.1, 0.9, and 0.5 respectively with a sample size of $n=25$. The video shows the resulting distributions.

4.2.1 - Normal Approximation to the Binomial

For the sampling distribution of the sample mean, we learned how to apply the Central Limit Theorem when the underlying distribution is not normal. In this section, we will present how we can apply the Central Limit Theorem to find the sampling distribution of the sample proportion. Let’s start by defining a Bernoulli random variable, $Y$.

Bernoulli Random Variable $\boldsymbol{Y}$

For an experiment that results in a success or a failure, let the random variable $Y$ equal 1, if there is a success, and 0 if there is a failure. Therefore,

$Y=\begin{cases} 1 & \text{success}\\ 0 & \text{failure}\end{cases}$

and let $p$ be the probability of a success.

The Bernoulli random variable is a special case of the Binomial random variable, where the number of trials is equal to one.

Suppose we have, say $n$, independent trials of this same experiment. Then we would have $n$ values of $Y$, namely $Y_1, Y_2, ...Y_n$.

If we define $X$ to be the sum of those values, we get...

$X=\sum_{i=1}^n Y_i$

$X$ is then a Binomial random variable with parameters $n$ and $p$.

You are probably wondering what this has to do with the sampling distribution of the sample proportion. Well, suppose we have a random sample of size $n$ from a population and are interested in a particular "success". Let the probability of success be $p$. We can label the successes as 1 and the failures as 0. The sample proportion, $\hat{p}$ would be the sum of all the successes divided by the number in our sample. Therefore,

$\hat{p}=\dfrac{\sum_{i=1}^n Y_i}{n}=\dfrac{X}{n}$

In other words, $\hat{p}$ could be thought of as a mean! If this is the case, we can apply the Central Limit Theorem for large samples!

Therefore, for large samples, the shape of the sampling distribution for $\hat{p}$ will be approximately normal. What about the mean and the standard deviation?

Mean and Standard Deviation [Standard Error] of $\hat{p}$

Given X is binomial...

The mean of $\hat{p}$
The mean of $\hat{p}$ would just be $p$ since the mean of $X$ is $\mu=np$ and $\hat{p}=\dfrac{X}{n}$.
The standard deviation [standard error] of $\hat{p}$
The standard error of $\hat{p}$ is $\sqrt{\dfrac{p(1-p)}{n}}$ since the standard deviation of $X$ is $\sqrt{np(1-p)}$.

4.2.2 - Sampling Distribution of the Sample Proportion

The distribution of the sample proportion approximates a normal distribution under the following 2 conditions.

Over the years the values of the conditions have changed. The examples that follow in the remaining lessons will use the first set of conditions at 5, however, you may come across other books or software that may use 10 or 15 for this value.

Book (Minitab)

$np \geq 5$
$n(1−p) \geq 5$

1990-2000s

$np \geq 10$
$n(1−p) \geq 10$

Current

$np \geq 15 $
$n(1-p) \geq 15 $

Sampling Distribution of the Sample Proportion

If any set of the two conditions listed above are satisfied, the sampling distribution of the sample proportion is...

approximately normal
with mean, $\mu=p$
standard deviation [standard error], $\sigma=\sqrt{\dfrac{p(1-p)}{n}}$

If the sampling distribution of $\hat{p}$ is approximately normal, we can convert a sample proportion to a z-score using the following formula:

$z=\dfrac{\hat{p}-p}{\sqrt{\dfrac{p(1-p)}{n}}}$

We can apply this theory to find probabilities involving sample proportions.

Example 4-4: iPhone Users

Suppose it is known that 43% of Americans own an iPhone. If a random sample of 50 Americans were surveyed, what is the probability that the proportion of the sample who owned an iPhone is between 45% and 50%?

Answer

For this problem, we know $p=0.43$ and $n=50$. First, we should check our conditions for the sampling distribution of the sample proportion.

$np=50(0.43)=21.5$ and $n(1-p)=50(1-0.43)=28.5$ - both are greater than 5.

Since the conditions are satisfied, $\hat{p}$ will have a sampling distribution that is approximately normal with mean $\mu=0.43$ and standard deviation [standard error] $\sqrt{\dfrac{0.43(1-0.43)}{50}}\approx 0.07$.

\begin{align} P(0.45<\hat{p}<0.5) &=P\left(\frac{0.45-0.43}{0.07}< \frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}}<\frac{0.5-0.43}{0.07}\right)\\ &\approx P\left(0.286<Z<1\right)\\ &=P(Z<1)-P(Z<0.286)\\ &=0.8413-0.6126\\ &=0.2287\end{align}

Therefore, if the true proportion of Americans who own an iPhone is 43%, then there would be a 22.87% chance that we would see a sample proportion between 45% and 50% when the sample size is 50.

Try it!

If a random sample of size of seventy five was surveyed, what is the probability we would find more than 50% of Americans with an iPhone?

First, check our conditions: $np=75(0.43)$ and $n(1-p)=75(1-0.43)$ are both greater than five. The sampling distribution of the sample proportion is approximately Normal with Mean $\mu=0.43$, Standard deviation $\sqrt{\frac{p(1-p)}{n}}=\sqrt{\frac{0.43(1-0.43)}{75}}\approx 0.05717$.

\begin{align}P\left(\hat{p}>0.5\right) &=\left(\frac{\hat{p}}{\sqrt{\frac{p(1-p)}{n}}}>\frac{0.5-0.43}{\sqrt{\frac{0.43(1-0.43)}{75}}}\right)\\ &\approx P\left(Z>1.22\right)\\&=1-P(Z<1.22)\\&=1-0.8888\\&=0.1112 \end{align}

Therefore, there is a 11.1% chance to get a sample proportion of 50% or higher in a sample size of 75.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility