# 4: Estimating with Confidence

4: Estimating with Confidence

## Overview

#### Case-Study: Marathon Runners

Imagine the start of the Boston Marathon. The swell of runners, all dressed to begin the 26.2- mile trek. Let’s help Ellie estimate the average number each runner runs per week... how will she know?

When we want to know something about a population, like the population of runners running the marathon, we are tasked with a monumental challenge, asking everyone. Ellie cannot practically ask every marathon runner, so instead, as most researchers do, she uses a sample of marathon runners, as we discussed in the Lesson 3 content. Assuming good sampling techniques are used, Ellie can ask each person in the sample how many miles per week they run. A simple question. Now given that she has the information, how does she turn this into information about her population of all marathon runners?

The distinction between the sample information we have and the population we seek is very important to keep track of. We KNOW the information about the sample. Ellie can calculate the average number of miles run per week for her sample. She can also calculate the standard deviation of the sample and make the appropriate graphs from her sample data. But what does this tell her about the population of runners?

Directly, nothing. Ellie will need to infer information about a population of runners from her sample. But first we need to point out the relationship of sample to populations.

If Ellie were to take many many samples of her population (without replacement, meaning each runner could only be in one sample), eventually she would include every person in the population. While each sample would have its own mean and standard deviation, the mean of the all the means would equal the population mean (remember, at the end of this hypothetical exercise, Ellie has the information on every runner, so the mean of all the runners is the population mean).

This hypothetical exercise produces something referred to as the sampling distribution of the means. Remember, this is a hypothetical exercise. There is no reason a researcher would actually take many many samples eventually arriving at the total population, unless of course, that research sets out to take a census of the entire population (in which case inferential statistics are not needed at all because the researcher already knows the population values!)

Ellie wants to know on average how far marathon runners run in any given week. She knows she can sample a portion of the larger population of runners that represent all runners in her area. She also knows that she needs to be careful in obtaining the sample to ensure it is randomly selected and represents the population that she wants. She conducts the observational study and calculates the number of miles each person runs per week. Next, she uses Minitab to calculate the average number of miles for her sample, as well as the standard deviation. But she isn’t quite sure how to use this sample information to answer her original question about the population of runners in her area. Many questions arise for Ellie, including...

• Did she sample the right people?
• How close is her sample mean and standard deviation to the actual population mean and standard deviation?
• What can she ‘safely’ conclude about the population based on this sample?

So this gets confusing right? We are working with POPULATIONS, SAMPLES, and now SAMPLING DISTRIBUTIONS. We have already defined populations and samples. This lesson will take a deeper look at sampling distributions.

Sampling Distribution
The sampling distribution of a statistic is a probability distribution based on a large number of samples of size $$n$$ from a given population.

## Objectives

Upon completion of this lesson, you should be able to:

• Identify the possibility of many samples within a sampling distribution.
• Equate the sum of all samples with the population in a sampling distribution
• Identify the standard deviation of the sampling distribution as the standard error of the sample
• Compute and interpret a confidence interval for means (quantitative data).
• Compute and interpret a confidence interval for proportions (categorical data).

# 4.1 - Sampling Distribution of the Sample Mean

4.1 - Sampling Distribution of the Sample Mean

Let’s put some numbers into Ellie’s example.

Note! The sampling method is done without replacement.

## Sample Means with a Small Population: Runner’s MIleage

In this example, the population is the mileage of six runners. Ellie is going to try to guess the true average mileage of the six runners by taking a random sample without replacement from the population.

Mileage A B C D E F
19 14 15 9 10 17

Since we know the miles from the population, we can find the population mean.

$$\mu=\dfrac{19+14+15+9+10+17}{6}=14$$ miles

To demonstrate the sampling distribution, let’s start with obtaining all of the possible samples of size $$n=2$$ from the populations, sampling without replacement. The table below show all the possible samples, the weights for the chosen runners the sample mean and the probability of obtaining each sample. Since we are drawing at random, each sample will have the same probability of being chosen.

View Full Table

 Sample Mileage $$\boldsymbol{\bar{x}}$$ Probability A, B 19, 14 16.5 $$\frac{1}{15}$$ A, C 19, 15 17.0 $$\frac{1}{15}$$ A, D 19, 9 14.0 $$\frac{1}{15}$$ A, E 19, 10 14.5 $$\frac{1}{15}$$ A, F 19, 17 18.0 $$\frac{1}{15}$$ B, C 14, 15 14.5 $$\frac{1}{15}$$ B, D 14, 9 11.5 $$\frac{1}{15}$$ B, E 14, 10 12.0 $$\frac{1}{15}$$ B, F 14, 17 15.5 $$\frac{1}{15}$$ C, D 15, 9 12.0 $$\frac{1}{15}$$ C, E 15, 10 12.5 $$\frac{1}{15}$$ C, F 15, 17 16.0 $$\frac{1}{15}$$ D, E 9, 10 9.5 $$\frac{1}{15}$$ D, F 9, 17 13.0 $$\frac{1}{15}$$ E, F 10, 17 13.5 $$\frac{1}{15}$$

We can combine all of the values and create a table of the possible values and their respective probabilities.

 $$\boldsymbol{\bar{x}}$$ 9.5 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.5 16.0 16.5 17.0 18.0 Probability $$\frac{1}{15}$$ $$\frac{1}{15}$$ $$\frac{2}{15}$$ $$\frac{1}{15}$$ $$\frac{1}{15}$$ $$\frac{1}{15}$$ $$\frac{1}{15}$$ $$\frac{2}{15}$$ $$\frac{1}{15}$$ $$\frac{1}{15}$$ $$\frac{1}{15}$$ $$\frac{1}{15}$$ $$\frac{1}{15}$$

The table is the probability table for the sample mean and it is the sampling distribution of the sample mean mileage of the runners when the sample size is 2. It is also worth noting that the sum of all the probabilities equals 1. It might be helpful to graph these values.

One can see that the chance that the sample mean is exactly the population mean is only 1 in 15, very small. (In some other examples, it may happen that the sample mean can never be the same value as the population mean.) When using the sample mean to estimate the population mean, some possible error will be involved since the sample mean is random.

Now that we have the sampling distribution of the sample mean, we can calculate the mean of all the sample means. In other words, we can find the mean (or expected value) of all the possible $$\bar{x}$$’s.

The mean of the sample means is

$$\mu_\bar{x}=\sum \bar{x}_{i}f(\bar{x}_i)=9.5\left(\frac{1}{15}\right)+11.5\left(\frac{1}{15}\right)+12\left(\frac{2}{15}\right)\\+12.5\left(\frac{1}{15}\right)+13\left(\frac{1}{15}\right)+13.5\left(\frac{1}{15}\right)+14\left(\frac{1}{15}\right)\\+14.5\left(\frac{2}{15}\right)+15.5\left(\frac{1}{15}\right)+16\left(\frac{1}{15}\right)+16.5\left(\frac{1}{15}\right)\\+17\left(\frac{1}{15}\right)+18\left(\frac{1}{15}\right)=14$$

Even though each sample may give you an answer involving some error, the expected value is right at the target: exactly the population mean. In other words, if one does the experiment over and over again, the overall average of the sample mean is exactly the population mean.

Now, let's do the same thing as above but with sample size $$n=5$$

 Sample Mileage $$\boldsymbol{\bar{x}}$$ Probability A, B, C, D, E 19, 14, 15, 9, 10 13.4 1/6 A, B, C, D, F 19, 14, 15, 9, 17 14.8 1/6 A, B, C, E, F 19, 14, 15, 10, 17 15.0 1/6 A, B, D, E, F 19, 14, 9, 10, 17 13.8 1/6 A, C, D, E, F 19, 15, 9, 10, 17 14.0 1/6 B, C, D, E, F 14, 15, 9, 10, 17 13.0 1/6

The sampling distribution is:

 $$\boldsymbol{\bar{x}}$$ 13.0 13.4 13.8 14.0 14.8 15.0 Probability 1/6 1/6 1/6 1/6 1/6 1/6

The mean of the sample means is...

$$\mu=(\dfrac{1}{6})(13+13.4+13.8+14.0+14.8+15.0)=14$$ miles

The following dot plots show the distribution of the sample means corresponding to sample sizes of $$n=2$$ and of $$n=5$$.

Again, we see that using the sample mean to estimate population mean involves sampling error. However, the error with a sample of size $$n=5$$ is on the average smaller than with a sample of size$$n= 2$$.

## Sampling Error and Size

Sampling Error
The error resulting from using a sample characteristic to estimate a population characteristic.

Sample size and sampling error: As the dot plots above show, the possible sample means cluster more closely around the population mean as the sample size increases. Thus, the possible sampling error decreases as sample size increases.

What happens when the population is not small?

## Sample Means with Large Samples

An instructor of an introduction to statistics course has 200 students. The scores out of 100 points are shown in the histogram. The population mean is $$\mu=69.77$$ and the population standard deviation is $$\sigma=10.9$$.

Let's demonstrate the samping distribution of the sample means using the StatKey website. The first video will demonstrate the sampling distribution of the sample mean when n = 10 for the exam scores data. The second video will show the same data but with samples of n = 30.

You should start to see some patterns. The mean of the sampling distribution is very close to the population mean. The standard deviation of the sampling distribution is smaller than the standard deviation of the population.

In the examples so far, we were given the population and sampled from that population.

What happens when we do not have the population to sample from? What happens when all that we are given is the sample? Fortunately, we can use some theory to help us. The mathematical details of the theory are beyond the scope of this course but the results are presented in this lesson.

In the next two sections, we will discuss the sampling distribution of the sample mean when the population is Normally distributed and when it is not.

# 4.1.2 - Population is Not Normal

4.1.2 - Population is Not Normal

What happens when the sample comes from a population that is not normally distributed? This is where the Central Limit Theorem (CLT) comes in.

#### Central Limit Theorem

For a large sample size (we will explain this later), $$\bar{x}$$ is approximately normally distributed, regardless of the distribution of the population one samples from. If the population has mean $$\mu$$ and standard deviation $$\sigma$$, then the distribution of $$\bar{x}$$ has mean $$\mu$$ and standard deviation $$\dfrac{\sigma}{\sqrt{n}}$$.

We should stop here to break down what this theorem is saying because the Central Limit Theorem is very powerful!

The Central Limit Theorem applies to a sample mean from any distribution. We could have a left-skewed or a right-skewed distribution. As long as the sample size is large, the distribution of the sample means will follow an approximate Normal distribution.

For the purposes of this course, a sample size of $$n>30$$ is considered a large sample.

For many people just learning statistics there is a "so what" thought about the CLT. Why is this important and why do I care? If you recall, when we introduced the idea of Z scores we did so with the caveat that the distribution was normal. We take the observed data, that is normally distributed, and convert the data to z scores creating a standard normal distribution. We then leveraged this distribution to find percentiles (and will in future units leverage this to find probabilities.

The CLT allows us to assume a distribution IS normal as long as the sample size is greater than 30 observations. With this, we can apply most of our inferential statistics without having to compensate for non-normal distributions. This will take on greater relevance as we move through the course.

## Sampling Distribution of the Sample Mean

With the Central Limit Theorem, we can finally define the sampling distribution of the sample mean.

Sampling Distribution of the Sample Mean

The sampling distribution of the sample mean will have:

• the same mean as the population mean, $$\mu$$
• Standard deviation [standard error] of $$\dfrac{\sigma}{\sqrt{n}}$$

It will be Normal (or approximately Normal) if either of these conditions is satisfied

• The population distribution is Normal
• The sample size is large (greater than 30).

# 4.1.1 - Population is Normal

4.1.1 - Population is Normal

If the population is normally distributed with mean $$\mu$$ and standard deviation $$\sigma$$, then the sampling distribution of the sample mean is also normally distributed no matter what the sample size is. When the sampling is done with replacement or if the population size is large compared to the sample size, then $$\bar{x}$$ has mean $$\mu$$ and standard deviation $$\dfrac{\sigma}{\sqrt{n}}$$. We use the term standard error for the standard deviation of a statistic, and since sample average, $$\bar{x}$$ is a statistic, standard deviation of $$\bar{x}$$ is also called standard error of $$\bar{x}$$. However, in some books you may find the term standard error for the estimated standard deviation of $$\bar{x}$$. In this class we use the former definition, that is, standard error of $$\bar{x}$$ is the same as standard deviation of $$\bar{x}$$.

Standard Deviation of $$\boldsymbol{\bar{x}}$$ [Standard Error]

$$SD(\bar{X})=SE(\bar{X})=\dfrac{\sigma}{\sqrt{n}}$$

# 4.2 - Sampling Distribution of the Sample Proportion

4.2 - Sampling Distribution of the Sample Proportion

Before we begin, let’s make sure we review the terms and notation associated with proportions:

• $$p$$ is the population proportion. It is a fixed value.

• $$n$$ is the size of the random sample.

• $$\hat{p}$$ is the sample proportion. It varies based on the sample.

Let's look at some of the runners in Ellie's sample to illustrate how to find the sampling distribution for an example where the population is small.

The five runners are Alex (A),Betina(B), Carly (C), Debbie (D), and Edward (E). The table below shows each runner's name and their favorite color running shoe.

 Name Alex (A) Betina(B) Carly (C) Debbie (D) Edward (E) Color Green Blue Yellow Purple Blue

We are interested in the proportion of runners who prefer blue shoes, and from the table, we can see that$$p = .40$$ of the runners prefer blue shoes.

Similar to the runner's mileage example earlier in the lesson, let's say we didn't know the proportion of runners who like blue as their favorite shoe color. We'll use resampling methods to estimate the proportion. Let’s take $$n=2$$ repeated samples, taken without replacement. Here are all the possible samples of size $$n=2$$ and their respective probabilities of the proportion of runners who like blue running shoes.

 Sample P(Blue) Probability AB 1/2 1/10 AC 0 1/10 AD 0 1/10 AE 1/2 1/10 BC 1/2 1/10 BD 1/2 1/10 BE 1 1/10 CD 0 1/10 CE 1/2 1/10 DE 1/2 1/10

The probability mass function (PMF) is:

 P(Blue) 0 1/2 1 Probability 3/10 6/10 1/10

The graph of the PMF:

#### Sampling Distribution of P(Blue)

The true proportion is $$p=P(Blue)=\frac{2}{5}$$. When the sample size is $$n=2$$, you can see from the PMF, it is not possible to get a sampling proportion that is equal to the true proportion.

Although not presented in detail here, we could find the sampling distribution for a larger sample size, say $$n=4$$. The PMF for n=4 is...

 P(Blue) 1/4 1/2 Probability 2/5 3/5

As with the sampling distribution of the sample mean, the sampling distribution of the sample proportion will have sampling error. It is also the case that the larger the sample size, the smaller the spread of the distribution.

# 4.2.1 - Normal Approximation to the Binomial

4.2.1 - Normal Approximation to the Binomial

For the sampling distribution of the sample mean, we learned how to apply the Central Limit Theorem when the underlying distribution is not normal. In this section, we will present how we can apply the Central Limit Theorem to find the sampling distribution of the sample proportion. Remember when we introduced quantitative and categorical data? In this example, we are working with a special type of categorical variable called Bernoulli random variable, $$Y$$.

A side note for those who are curious:  A Bernoulli random variable is a very simple kind of variable. It only has two possible values, 0 and 1 and there is only one trial. This is different from a binomial random variable in that there are repeated independent trails. We will not focus too much on these differences in this course but if you are curious this might be information to have!

Bernoulli Random Variable $$\boldsymbol{Y}$$

For an experiment that results in a success or a failure , let the random variable equal 1, if there is a success, and 0 if there is a failure. Therefore,

$$f(y)=\begin{cases} 1 & \text{success}\\ 0 & \text{failure}\end{cases}$$

and let $$p$$ be the probability of a success.

The Bernoulli random variable is a special case of the Binomial random variable, where the number of trials is equal to one.

Suppose we have, say $$n$$, independent trials of this same experiment. Then we would have $$n$$ values of $$Y$$, namely $$Y_1, Y_2, ...Y_n$$.

If we define $$X$$ to be the sum of those values, we get...

$$X=\sum_{i=1}^n Y_i$$

$$X$$ is then a Binomial random variable with parameters $$n$$ and $$p$$.

You are probably wondering what this has to do with the sampling distribution of the sample proportion. Well, suppose we have a random sample of size $$n$$ from a population and are interested in a particular “success”. Let the probability of success be $$p$$. We can label the successes as 1 and the failures as 0. The sample proportion, $$\hat{p}$$ would be the sum of all the successes divided by the number in our sample. Therefore,

$$\hat{p}=\dfrac{\sum_{i=1}^n Y_i}{n}=\dfrac{X}{n}$$

In other words, $$\hat{p}$$ could be thought of as a mean! If this is the case, we can apply the Central Limit Theorem for large samples!

Therefore, for large samples, the shape of the sampling distribution for $\hat{p}$ will be approximately normal. What about the mean and the standard deviation?

Mean and Standard Deviation [Standard Error] of the Sample Proportion, $$\hat{p}$$

Given X is binomial...

• The mean of $$\hat{p}$$
• The mean of $$\hat{p}$$ would just be $$p$$ since the mean of $$X$$ is $$\mu=np$$ and $$\hat{p}=\dfrac{X}{n}$$.
• The standard deviation [standard error] of $$\hat{p}$$
• The standard error of $$\hat{p}$$ is $$\sqrt{\dfrac{p(1-p)}{n}}$$ since the standard deviation of $$X$$ is $$\sqrt{np(1-p)}$$.

# 4.2.2 - Sampling Distribution of the Sample Proportion

4.2.2 - Sampling Distribution of the Sample Proportion

The distribution of the sample proportion approximates a normal distribution under the following 2 conditions.

Over the years the values of the conditions have changed. The examples that follow in the remaining lessons will use the first set of conditions at 5, however, you may come across other books or software that may use 10 or 15 for this value.

Book (Minitab)
1. $$np \geq 5$$
2. $$n(1−p) \geq 5$$
1990-2000s
1. $$np \geq 10$$
2. $$n(1−p) \geq 10$$
Current
1. $$np \geq 15$$
2. $$n(1-p) \geq 15$$

## Sampling Distribution of the Sample Proportion

If any set of the two conditions listed above are satisfied, the sampling distribution of the sample proportion is...

• approximately normal
• with mean, $$\mu=p$$
• standard deviation [standard error], $$\sigma=\sqrt{\dfrac{p(1-p)}{n}}$$

Why is this important? This is similar to the notes in the section on the CLT. If the sampling distribution of $$\hat{p}$$ is approximately normal, we can convert a sample proportion to a z-score using the following formula:

$$z=\dfrac{\hat{p}-p}{\sqrt{\dfrac{p(1-p)}{n}}}$$

We can apply this theory to find probabilities involving sample proportions.

Now we have a basic understanding of the relationship between samples and populations. Ellie will need to use the properties of the sampling distribution to work from the mean from her sample of runners to the larger distribution of all means of all populations of runners, but this does not directly answer her question about the average number of miles all runners run. To do this, she needs to use another related technique called a confidence interval. Calculating a confidence interval will allow Ellie to estimate an interval that is likely to contain the true average number of miles run per week, based on her sample information. Let’s take a closer look at confidence intervals

# 4.3 - Introduction to Inferences

4.3 - Introduction to Inferences

What Ellie is trying to do is called statistical inference, yielding probability statements about the population of interest based on a sample data.

### Types of Statistical Inference

There are two types of statistical inferences: Estimation and Statistical Tests.

Estimation

Use information from the sample to estimate (or predict) the parameter of interest.

For instance, Ellie’s use of the results from her sample of runners to estimate (or predict) the true number of miles run by marathon runners.

Statistical Tests

Use information from the sample to determine whether a certain statement about the parameter of interest is true. Statistical tests are also referred to as hypothesis tests.

For instance, Ellie might want to test a claim that marathon runners run 50 miles per week. She wants to determine whether that statement is supported by her sample data.

# 4.4 - Estimation and Confidence Intervals

4.4 - Estimation and Confidence Intervals

Two common estimation methods are point and interval estimates.

Point Estimates
An estimate for a parameter that is one numerical value. An example of a point estimate is the sample mean or the sample proportion.
Interval Estimates
An estimate for a parameter that is an interval as the estimate for a parameter.

This is a new concept that is the focus of this lesson. Such intervals are built around point estimates which is why understanding point estimates is important to understanding interval estimates.

In this course, the interval estimates we find are referred to as confidence intervals.

Confidence Interval
Interval of values computed from sample data that is likely to cover the true parameter of interest.

There are many estimators for population parameters. For example, if we want to know the "center" of a distribution, why use the mean? Could we use the median? How about using the middle value, i.e. (max+min)/2? We choose particular estimators for various reasons with information based on their sampling distributions. Here are some properties of "good" estimators.

# 4.4.1 - Properties of 'Good' Estimators

4.4.1 - Properties of 'Good' Estimators

In determining what makes a good estimator, there are two key features:

1. The center of the sampling distribution for the estimate is the same as that of the population. When this property is true, the estimate is said to be unbiased. The most often-used measure of the center is the mean.
2. The estimate has the smallest standard error when compared to other estimators. For example, in the normal distribution, the mean and median are essentially the same. However, the standard error of the median is about 1.25 times that of the standard error of the mean. We know the standard error of the mean is $$\frac{\sigma}{\sqrt{n}}$$. Therefore in a normal distribution, the SE(median) is about 1.25 times $$\frac{\sigma}{\sqrt{n}}$$. This is why the mean is a better estimator than the median when the data is normal (or approximately normal).
Note!

We should stop here and explain why we use the estimated standard error and not the standard error itself when constructing a confidence interval. Remember we are using the known values from our sample to estimate the unknown population values. Therefore we cannot use the actual population values! This is actually easier to see by presenting the formulas. If we used the following as the standard error, we would not have the values for $$p$$  (because this is the population parameter):

$$\sqrt{\dfrac{p(1-p)}{n}}$$

Instead we have to use the estimated standard error by using $$\hat{p}$$  In this case the estimated standard error is...

$$\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}$$

For the case for estimating the population mean, the population standard deviation, $$\sigma$$, may also be unknown. When it is unknown, we can estimate it with the sample standard deviation, s. Then the estimated standard error of the sample mean is...

$$\dfrac{s}{\sqrt{n}}$$

# 4.4.2 - General Format of a Confidence Interval

4.4.2 - General Format of a Confidence Interval

In putting the two properties above together, the center of our interval should be the point estimate for the parameter of interest. With the estimated standard error of the point estimate, we can include a measure of confidence to our estimate by forming a margin of error.

This you may have readily seen whenever you have heard or read a sample survey result (e.g. a survey of the current approval rating of the President, or attitude citizens have on some new policy). In such surveys, you may hear reference to the "44% of those surveyed approved of the President's reaction" (this is the sample proportion), and "the survey had a 3.5% margin or error, or ± 3.5%." This latter number is the margin of error.

With the point estimate and the margin of error, we have an interval for which the group conducting the survey is confident the parameter value falls (i.e. the proportion of U.S. citizens who approve of the President's reaction). In this example, that interval would be from 40.5% to 47.5%.

This example provides the general construction of a confidence interval:

General form of a confidence interval
$$sample\ statistic \pm margin\ of\ error$$

The margin of error will consist of two pieces. One is the standard error of the sample statistic. The other is some multiplier, $$M$$, of this standard error, based on how confident we want to be in our estimate. This multiplier will come from the same distribution as the sampling distribution of the point estimate; for example, as we will see with the sample proportion this multiplier will come from the standard normal distribution. The general form of the margin of error is shown below.

General form of the margin of error
$$\text{Margin of error}=M\times \hat{SE}(\text{estimate})$$

*the multiplier, $$M$$, depends on our level of confidence

# 4.4.3 Interpretation of a Confidence Interval

4.4.3 Interpretation of a Confidence Interval

The interpretation of a confidence interval has the basic template of: "We are 'some level of percent confident' that the 'population of interest' is from 'lower bound to upper bound'. After Ellie calculates at 95% confidence interval, she could say she is 95% confident that the true population average number of miles run by marathon runners is between “the values of the confidence interval”. The phrases in single quotes are replaced with the specific language of the problem. We will discuss more about the interpretation of a confidence interval after we provide a few more examples.

Some might say, "Why not just be 100% confident?", but that does not make practical sense. For instance, what value comes from me saying I am 100% confident that the approval rating for the President is from 0% to 100%. That is the only interval in which one can be truly confident will capture the actual proportion. Similarly, if you were to ask your professor what they think your score will be on an exam and they reply, "zero to one hundred", what would you think of that answer?

However, one does want to be as confident as reasonably possible. Most confidence levels use ranges from 90% confidence to 99% confidence, with 95% being the most widely used. In fact, when you read a report that includes a margin of error, you can usually assume this has a 95% confidence attached to it unless otherwise stated.

### Moving forward...

We're going to begin exploring confidence intervals for one population proportions. The important issue of determining the required sample size to estimate a population proportion will also be discussed in detail in this lesson.

# 4.5 - Inference for the Population Proportion

4.5 - Inference for the Population Proportion

Earlier in the lesson, we talked about two types of estimation, point, and interval. Let's now apply them to estimate a population proportion from sample data.

Point Estimate for the Population Proportion

The point estimate of the population proportion, $$p$$, is:

$$\hat{p}=$$ # of successes in the sample of size n

## Confidence Interval for the Population Proportion

Recall that:

If $$np$$ and $$n(1-p)$$ are greater than five, then $$\hat{p}$$ is approximately normal with mean, $$p$$, standard error $$\sqrt{\frac{p(1-p)}{n}}$$.

Under these conditions, the sampling distribution of the sample proportion, $$\hat{p}$$, is approximately Normal. The multiplier used in the confidence interval will come from the Standard Normal distribution.

# 4.5.1 - Construct and Interpret the CI

4.5.1 - Construct and Interpret the CI

To construct a confidence interval we're going to use the following 3 steps:

1. Step 1: Check Condition

Check all conditions before using the sampling distribution of the sample proportion.

We previously used $$np$$ and $$n(1-p)$$. But $$p$$ is not known. Therefore, for the confidence interval, we will use:

• $$n\hat{p}>5$$ and
• $$n(1-\hat{p})>5$$
2. For a confidence interval for a proportion, there is a technique called exact methods. These methods can be used if the software offers it. These exact methods are more complicated and are based on the relationship between the binomial and another distribution we will later learn called the F-distribution. The Z-method is much simpler and fairly easy to compute. In fact, if you ever come across a published random survey (e.g. a Gallup poll) you can use the methods in this lesson to construct a reliable proportion confidence interval rather quickly.

What can one do if the conditions are NOT satisfied?

3. Step 2: Construct the General Form

The general form of the confidence interval is '$$\text{point estimate }\pm M\times \hat{SE}(\text{estimate})$$.' The point estimate is the sample proportion, $$\hat{p}$$, and the estimated standard error is $$\hat{SE}(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$. If the conditions are satisfied, then the sampling distribution is approximately normal. Therefore, the multiplier comes from the normal distribution. This interval is also known as the one-sample z-interval for $$p$$, or the Normal Approximation confidence interval for $$p$$.

$$\boldsymbol{\left(1-\alpha \right) 100\%}$$ confidence interval for the population proportion, $$\boldsymbol{p}$$

$$\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}$$

where $$z_{\alpha/2}$$ represents a z-value with $$\alpha/2$$ area to the right of it.

General notes about the confidence interval...
• The $$\pm$$ in the formula above means "plus or minus". It is a shorthand way of writing
• $$(\hat{p}-z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}})$$
• It is centered at the point estimate, $$\hat{p}$$.
• The width of the interval is determined by the margin of error.
• You must determine the multiplier.
4. Step 3: Interpret the Confidence Interval

Applying the template from earlier in the lesson we can say we are $$(1-\alpha)100\%$$ confident that the population proportion is between $$\hat{p}-z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$ and $$\hat{p}+z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$. The examples will go into more detail regarding the interpretation of the confidence interval.

What terms in the margin of error would change the width of the confidence interval? Do the changes make it narrower or wider?

## Construct a CI using Minitab

To construct a 1-proportion confidence interval...

1. In Minitab choose Stat > Basic Statistics > 1 proportion .
2. From the drop down box select the Summarized data option button. (If you have the raw data you would use the default drop down of One or more samples, each in a column.)
3. Enter the number of successes in the Number of Events text box, and the sample size in the Number of Trials text box.
4. Choose the Options button. The default confidence level is 95. If your desire another confidence level edit appropriately.
5. To use the z- interval method choose Normal Approximation from the Method text box. The exact interval is always appropriate and is the default. Under the conditions that: $n \hat{p} \ge 5, n(1− \hat{p}) \ge 5$, one can also use the z-interval to approximate the answers. The exact interval and the z-interval should be very similar when the conditions are satisfied.
6. Choose OK and OK again.

# 4.5.2 - Derivation of the Confidence Interval

4.5.2 - Derivation of the Confidence Interval

To calculate the confidence interval, we need to know how to find the z-multiplier. So where does this $$z_{\alpha}$$ come from?

The confidence interval can be derived from the following fact:

\begin{align} P\left(\left|\frac{\hat{p}-p}{\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}}\right|\le z_{\alpha/2}\right)=1-\alpha \\ P\left(-z_{\alpha/2}\le \dfrac{\hat{p}-p}{\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}}\le z_{\alpha/2}\right)=1-\alpha \\ P\left(\hat{p}-z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\le p \le \hat{p}+z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\right)=1-\alpha  \end{align}

The figure shows the general confidence interval on the normal curve.

How to find the multiplier using the Standard Normal Distribution

$$z_a$$ is the z-value having a tail area of $$a$$ to its right. With some calculation, one can use the Standard Normal Cumulative Probability Table to find the value.

## Commonly Used Alpha Levels

The table is a list of frequently used alphas andtheir  $$z_{\alpha/2}$$ multipliers.

Confidence level and corresponding multiplier
Confidence Level $$\boldsymbol{\alpha}$$ $$\boldsymbol{z_{\alpha/2}}$$ $$\boldsymbol{z_{\alpha/2}}$$ Multiplier
90% .10 $$z_{0.05}$$ 1.645
95% .05 $$z_{0.025}$$ 1.960
98% .02 $$z_{0.01}$$ 2.326
99% .01 $$z_{0.005}$$ 2.576

The value of the multiplier increases as the confidence level increases. This leads to wider intervals for higher confidence levels. We are more confident of catching the population value when we use a wider interval.

# 4.5.3 - Interpreting the CI

4.5.3 - Interpreting the CI

In the graph below, we show 10 replications (for each replication, we sample 30 students and ask them whether they are Democrats) and compute an 80% Confidence Interval each time. We are lucky in this set of 10 replications and get exactly 8 out of 10 intervals that contain the parameter. Due to the small number of replications (only 10), it is quite possible that we get 9 out of 10 or 7 out of 10 that contain the true parameter. On the other hand, if we try it 10,000 (a large number of) times, the percentage that contains the true proportions will be very close to 80%.

If we repeatedly draw random samples of size n from the population where the proportion of success in the population is $$p$$ and calculate the confidence interval each time, we would expect that $$100(1 - \alpha)\%$$ of the intervals would contain the true parameter, $$p$$.

# 4.5.4 - Sample Size Computation

4.5.4 - Sample Size Computation

## Sample Size Computation for the Population Proportion Confidence Interval

An important part of obtaining desired results is to get a large enough sample size. We can use what we know about the margin of error and the desired level of confidence to determine an appropriate sample size.

Recall that the margin of error, E, is half of the width of the confidence interval. Therefore for a one-sample proportion,

$$E=z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}$$

Precision
The wider the interval, the poorer the precision. Note that the higher the confidence level, the wider the width (or equivalently, half-width) of the interval and thus the poorer the precision.

Since the confidence level reflects the success rate of the method we use to get the confidence interval, we like to have a narrower interval while keeping the confidence level at a reasonably higher level.

For most newspapers and magazine polls, it is understood that the margin of error is calculated for a 95% confidence interval (if not stated otherwise). A 3% margin of error is a popular choice also. For instance, you might see a television poll state that the "approval rating of the president is 72%; the margin of error of the poll is plus or minus 3%."

If we want the margin of error smaller (i.e., narrower intervals), we can increase the sample size. Or, if you calculate a 90% confidence interval instead of a 95% confidence interval, the margin of error will also be smaller. However, when one reports it, remember to state that the confidence interval is only 90% because otherwise, people will assume 95% confidence.

If the desired margin of error E is specified and the desired confidence level is specified, the required sample size to meet the requirements can be calculated by two methods:

Educated Guess

$$n=\dfrac{z^2_{\alpha/2}\hat{p}_g(1-\hat{p}_g)}{E^2}$$

Where $$\hat{p}_g$$ is an educated guess for the parameter $$p$$.

*The educated guess method is used if it is relatively inexpensive to sample more elements when needed.

Conservative Method

$$n=\dfrac{z^2_{\alpha/2}(\frac{1}{2})^2}{E^2}$$

This formula can be obtained from part (a) using the fact that:

For $$0 \le p \le 1, p (1 - p)$$ achieves its largest value at $$p=\frac{1}{2}$$.

*The conservative method is used if the start-up cost of sampling is expensive and thus it is not economical to sample more elements later.

The sample size obtained from using the educated guess is usually smaller than the one obtained using the conservative method. This smaller sample size means there is some risk that the resulting confidence interval may be wider than desired. Using the sample size by the conservative method has no such risk.

### Cautions About Sample Size Calculations

1. Why do we need to round up?

Because we are estimating the smallest sample size needed to produce the desired error. Since we cannot sample a portion of a subject (e.g. we cannot take 0.66 of a subject) we need to round up to guarantee a large enough sample.

2. Remember that this is the minimum sample size needed for our study.

If we encounter a situation where the response rate is not 100% then if we just sample the calculated size, in the end, we will end up with a less than desired sample size. To counter this, we can adjust the calculated sample size by dividing by an anticipated response rate. For instance, using the above example if we expected about 40% of those contacted to actually participate in our survey (i.e. a 40% response rate) then we would need to sample 7745/0.4=19,362.5 or 19,363. In other words, our actual sample size would need to be 19,363 given the 40% response rate.

# 4.6 - Inference for the Population Mean

4.6 - Inference for the Population Mean

In this section, we discuss how to find confidence intervals for the population mean. The idea and interpretation of the confidence interval will be similar to that of the population proportion only applied to the population mean, $$\mu$$.

We start with the case where the population standard deviation, $$\sigma$$, is known. We continue to the more realistic case where $$\sigma$$ is not known. For the latter case, we need to recall the $$t$$-distribution. We end this section by presenting how to determine a sample size for a desired margin of error and confidence.

Point Estimates for a Population Mean

The point estimate of the population mean, $$\mu$$ is:

$$\bar{x}=$$ sample mean

If one wants to know how accurate the sample mean is to estimate the population mean, we need some probability statement. We will want to know the sampling distribution of $$\bar{x}$$. From this distribution, we can get a confidence interval. Such an interval provides a range of values for which the parameter value is believed to fall. An interval is more likely to be "correct" than a point estimate.

# 4.6.1 - Construct and Interpret the CI

4.6.1 - Construct and Interpret the CI

## Constructing a Confidence Interval for the Population Mean

To construct a confidence interval for a population mean, we're going to apply the same three steps as with the population proportion, but first, let's look at the two possible cases.

## Case 1: $$\sigma$$ is known

In the previous lesson, we learned that if the population is normal with mean $$\mu$$ and standard deviation, $$\sigma$$, then the distribution of the sample mean will be Normal with mean $$\mu$$ and standard error $$\frac{\sigma}{\sqrt{n}}$$.

Following the similar idea to developing the confidence interval for $$p$$, the $$(1-\alpha)$$100% confidence interval for the population mean $$\mu$$ is...

$$P\left(\left|\dfrac{\bar{x}-\mu}{\dfrac{\sigma}{\sqrt{n}}}\right|\le z_{\alpha/2}\right)=1-\alpha$$

A little bit of algebra will lead you to...

$$P\left(\bar{x}-z_{\alpha/2}\dfrac{\sigma}{\sqrt{n}}\le \mu\le \bar{x}+z_{\alpha/2}\dfrac{\sigma}{\sqrt{n}}\right)=1-\alpha$$

In other words, the $$(1-\alpha)$$100% confidence interval for $$\mu$$ is:

$$\bar{x}\pm z_{\alpha/2}\dfrac{\sigma}{\sqrt{n}}$$

Notice for this case, the only condition we need is the population distribution to be normal.

Note!

The case where $$\sigma$$ is known is unrealistic. We explain it here briefly because it reinforces what we have previously learned. We do not present examples in this case.

## Case 2: $$\sigma$$ is unknown

When the population is normal or when the sample size is large then,

$$Z=\dfrac{\bar{x}-\mu}{\dfrac{\sigma}{\sqrt{n}}}$$

where Z has a standard Normal distribution.

Usually, we don't know $$\sigma$$, so what can we do?

Recall that if X comes from a normal distribution with mean, $$\mu$$, and variance, $$\sigma^2$$, or if $$n\ge 30$$, then the sampling distribution will be approximately normal with mean $\mu$ and standard error, $$SE(\bar{X})=\frac{\sigma}{\sqrt{n}}$$

One way to estimate $$\sigma$$ is by $$s$$, the standard deviation of the sample, and replace $$\sigma$$ by $$s$$ in the above Z-equation. However, this new quotient no longer has a Z-distribution. Instead it has a t-distribution. We call the following a ‘studentized’ version of $$\bar{X}$$:

$$t=\dfrac{\bar{X}-\mu}{\dfrac{s}{\sqrt{n}}}$$

## Constructing the Confidence Interval

1. Step 1: Check the Conditions

One of the following conditions need to be satisfied:

1. If the sample comes from a Normal distribution, then the sample mean will also be normal. In this case, $$\dfrac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$$ will follow a $$t$$-distribution with $$n-1$$ degrees of freedom.

2. If the sample does not come from a normal distribution but the sample size is large ($$n\ge 30$$), we can apply the Central Limit Theorem and state that $$\bar{X}$$ is approximately normal. Therefore, $$\dfrac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$$ will follow a $$t$$-distribution with $$n-1$$ degrees of freedom.

1. Step 2: Construct the General Form

$$(1-\alpha)$$100% Confidence Interval for the Population Mean, $$\mu$$

$$\bar{x}\pm t_{\alpha/2}\dfrac{s}{\sqrt{n}}$$

where the t-distribution has $$df = n - 1$$. This interval is also known as the one-sample t-interval for the population mean.

2. Step 3: Interpret the Confidence Interval

We are $$(1-\alpha)100\%$$ confident that the population mean, $$\mu$$, is between $$\bar{x}-t_{\alpha/2}\frac{s}{\sqrt{n}}$$ and $$\bar{x}+t_{\alpha/2}\frac{s}{\sqrt{n}}$$.

What will you do if you cannot use the t-interval? What do we do when the above conditions are not satisfied?

1. If you do not know if the distribution comes from a normally distributed population and the sample size is small (i.e $$n<30$$), you can use the Normal Probability Plot to check if the data come from a normal distribution.

2. You may want to consider what is known as nonparametric statistical methods. A procedure such as the one-sample Wilcoxon procedure. Lesson 11 introduces nonparametric statistical methods.

## Construct a CI using Minitab

Find the CI for a population mean in Minitab:

1. In Minitab choose Stat> Basic Statistics > 1-Sample t.
2. From the drop down box select the Summarized data option button. (If you have the raw data you would use the default drop down of One or more samples, each in a column.)
3. Enter the sample size, sample mean, and sample standard deviation in their respective text boxes.
4. Click the Options button. The default confidence level is 95. If your desire another confidence level edit appropriately.
5. Click OK and OK again.

# 4.6.2 - The t-distribution

4.6.2 - The t-distribution

In 1908, William Sealy Gosset from Guinness Breweries discovered the t-distribution. His pen-name was Student and thus it is called the "Student's t-distribution."

The t-distribution is different for different sample size, n. Thus, tables, as detailed as the standard normal table, are not provided in the usual statistics books. The graph below shows the t-distribution for degrees of freedom of 10 (blue) and 30 (red dashed).

## Properties of the t-distribution

1. t is symmetric about 0
2. t-distribution is more variable than the Standard Normal distribution
3. t-distributions are different for different degrees of freedom (d.f.).
4. The larger $$n$$ gets (or as $$n$$ goes to infinity), the closer the $$t$$-distribution is to the $$z$$.
5. The meaning of $$t_\alpha$$ is the $$t$$-value having the area "$$\alpha$$" to the right of it.

# 4.6.3 Checking Normality

4.6.3 Checking Normality

## Using Normal Probability Plot to Check Normality

If the sample size is less than 30, one needs to use a Normal Probability Plot to check whether the assumption that the data come from a normal distribution is valid.

Normal Probability Plot
The Normal Probability Plot is a graph that allows us to assess whether or not the data comes from a normal distribution.
Note!

This plot should be used as a guide for us to assess if the assumption that the data come from a normal distribution is valid or not. It should not be used to “test” an assumption.

# 4.6.4 - Sample Size Computation

4.6.4 - Sample Size Computation

## Sample Size Computation for the Population Mean Confidence Interval

Recall that a $$(1-\alpha)$$100% confidence interval for $$\mu$$ is $$\bar{x}\pm t_{\alpha/2}\dfrac{s}{\sqrt{n}}$$ where the multiplier $$t$$ has a t-distribution with $$df = n - 1$$. Thus, the margin of error, E, is equal to:

$$E=t_{\alpha/2}\dfrac{s}{\sqrt{n}}$$

To determine the sample size, one first decides the confidence level and the half width of the interval one wants. Then we can find the sample size to yield an interval with that confidence level and with a half width not more than the specified one. The crude method to find the sample size: $$n=\left(\dfrac{z_{\alpha/2}\sigma}{E}\right)^2$$ Then round up to the next whole integer.

## The Iterative Method

A more accurate method to estimate the sample size: iteratively evaluate the formula since the t value also depends on n.

$$n=\left(\dfrac{t_{\alpha/2}s}{E}\right)^2$$

Use the example above for illustration. Start with an initial guess for $n$, plug in the formula, and iteratively solve for $$n$$.

If the initial guess for $$n$$ is 20, $$t_{0.05} = 1.729$$ and degrees of freedom = 19,

$$n=\left(\dfrac{t_{\alpha/2}s}{E}\right)^2=n=\left(\dfrac{1.729(400)}{120}\right)^2=33.21$$

For $$n = 34$$, degree of freedom = 33, and $$t_{0.05} = 1.697$$

$$n=\left(\dfrac{t_{\alpha/2}s}{E}\right)^2=n=\left(\dfrac{1.697(400)}{120}\right)^2=31.99$$

If we use $$n = 32$$, the result is the same. Thus, the more accurate answer to the example is to sample 32 students.

# 4.7 - Summary

4.7 - Summary

So now, Ellie can work from her sample statistics (the mean and standard deviation from her sample) to an estimate of the true population parameter using a confidence interval. In doing so, she leveraged the information about the size of the sample, utilized good sampling techniques, and employed some useful descriptive statistics about her sample. Ellie can now make a confident guess as to how much running the marathoners put in to train for the big event!

  Link ↥ Has Tooltip/Popover Toggleable Visibility