4.2 - Sampling Distribution of the Sample Proportion

Before we begin, let’s make sure we review the terms and notation associated with proportions:

$p$ is the population proportion. It is a fixed value.
$n$ is the size of the random sample.

$\hat{p}$ is the sample proportion. It varies based on the sample.

Let's look at some of the runners in Ellie's sample to illustrate how to find the sampling distribution for an example where the population is small.

The five runners are Alex (A),Betina(B), Carly (C), Debbie (D), and Edward (E). The table below shows each runner's name and their favorite color running shoe.

Name	Alex (A)	Betina(B)	Carly (C)	Debbie (D)	Edward (E)
Color	Green	Blue	Yellow	Purple	Blue

We are interested in the proportion of runners who prefer blue shoes, and from the table, we can see that$p = .40$ of the runners prefer blue shoes.

Similar to the runner's mileage example earlier in the lesson, let's say we didn't know the proportion of runners who like blue as their favorite shoe color. We'll use resampling methods to estimate the proportion. Let’s take $n=2$ repeated samples, taken without replacement. Here are all the possible samples of size $n=2$ and their respective probabilities of the proportion of runners who like blue running shoes.

Sample	P(Blue)	Probability
AB	1/2	1/10
AC	0	1/10
AD	0	1/10
AE	1/2	1/10
BC	1/2	1/10
BD	1/2	1/10
BE	1	1/10
CD	0	1/10
CE	1/2	1/10
DE	1/2	1/10

The probability mass function (PMF) is:

P(Blue)	0	1/2	1
Probability	3/10	6/10	1/10

The graph of the PMF:

Sampling Distribution of P(Blue)

The true proportion is $p=P(Blue)=\frac{2}{5}$. When the sample size is $n=2$, you can see from the PMF, it is not possible to get a sampling proportion that is equal to the true proportion.

Although not presented in detail here, we could find the sampling distribution for a larger sample size, say $n=4$. The PMF for n=4 is...

P(Blue)	1/4	1/2
Probability	2/5	3/5

As with the sampling distribution of the sample mean, the sampling distribution of the sample proportion will have sampling error. It is also the case that the larger the sample size, the smaller the spread of the distribution.

4.2.1 - Normal Approximation to the Binomial

For the sampling distribution of the sample mean, we learned how to apply the Central Limit Theorem when the underlying distribution is not normal. In this section, we will present how we can apply the Central Limit Theorem to find the sampling distribution of the sample proportion. Remember when we introduced quantitative and categorical data? In this example, we are working with a special type of categorical variable called Bernoulli random variable, $Y$.

A side note for those who are curious: A Bernoulli random variable is a very simple kind of variable. It only has two possible values, 0 and 1 and there is only one trial. This is different from a binomial random variable in that there are repeated independent trails. We will not focus too much on these differences in this course but if you are curious this might be information to have!

Bernoulli Random Variable $\boldsymbol{Y}$

For an experiment that results in a success or a failure , let the random variable equal 1, if there is a success, and 0 if there is a failure. Therefore,

$f(y)=\begin{cases} 1 & \text{success}\\ 0 & \text{failure}\end{cases}$

and let $p$ be the probability of a success.

The Bernoulli random variable is a special case of the Binomial random variable, where the number of trials is equal to one.

Suppose we have, say $n$, independent trials of this same experiment. Then we would have $n$ values of $Y$, namely $Y_1, Y_2, ...Y_n$.

If we define $X$ to be the sum of those values, we get...

$X=\sum_{i=1}^n Y_i$

$X$ is then a Binomial random variable with parameters $n$ and $p$.

You are probably wondering what this has to do with the sampling distribution of the sample proportion. Well, suppose we have a random sample of size $n$ from a population and are interested in a particular “success”. Let the probability of success be $p$. We can label the successes as 1 and the failures as 0. The sample proportion, $\hat{p}$ would be the sum of all the successes divided by the number in our sample. Therefore,

$\hat{p}=\dfrac{\sum_{i=1}^n Y_i}{n}=\dfrac{X}{n}$

In other words, $\hat{p}$ could be thought of as a mean! If this is the case, we can apply the Central Limit Theorem for large samples!

Therefore, for large samples, the shape of the sampling distribution for $\hat{p}$ will be approximately normal. What about the mean and the standard deviation?

Mean and Standard Deviation [Standard Error] of the Sample Proportion, $\hat{p}$

Given X is binomial...

The mean of $\hat{p}$
- The mean of $\hat{p}$ would just be $p$ since the mean of $X$ is $\mu=np$ and $\hat{p}=\dfrac{X}{n}$.
The standard deviation [standard error] of $\hat{p}$
- The standard error of $\hat{p}$ is $\sqrt{\dfrac{p(1-p)}{n}}$ since the standard deviation of $X$ is $\sqrt{np(1-p)}$.

4.2.2 - Sampling Distribution of the Sample Proportion

The distribution of the sample proportion approximates a normal distribution under the following 2 conditions.

Over the years the values of the conditions have changed. The examples that follow in the remaining lessons will use the first set of conditions at 5, however, you may come across other books or software that may use 10 or 15 for this value.

Book (Minitab)

$np \geq 5$
$n(1−p) \geq 5$

1990-2000s

$np \geq 10$
$n(1−p) \geq 10$

Current

$np \geq 15 $
$n(1-p) \geq 15 $

Sampling Distribution of the Sample Proportion

If any set of the two conditions listed above are satisfied, the sampling distribution of the sample proportion is...

approximately normal
with mean, $\mu=p$
standard deviation [standard error], $\sigma=\sqrt{\dfrac{p(1-p)}{n}}$

Why is this important? This is similar to the notes in the section on the CLT. If the sampling distribution of $\hat{p}$ is approximately normal, we can convert a sample proportion to a z-score using the following formula:

$z=\dfrac{\hat{p}-p}{\sqrt{\dfrac{p(1-p)}{n}}}$

We can apply this theory to find probabilities involving sample proportions.

Now we have a basic understanding of the relationship between samples and populations. Ellie will need to use the properties of the sampling distribution to work from the mean from her sample of runners to the larger distribution of all means of all populations of runners, but this does not directly answer her question about the average number of miles all runners run. To do this, she needs to use another related technique called a confidence interval. Calculating a confidence interval will allow Ellie to estimate an interval that is likely to contain the true average number of miles run per week, based on her sample information. Let’s take a closer look at confidence intervals

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility