# 4: Estimating with Confidence

## Overview Section

#### Case-Study: Marathon Runners

Imagine the start of the Boston Marathon. The swell of runners, all dressed to begin the 26.2- mile trek. Let’s help Ellie estimate the average number each runner runs per week... how will she know?

When we want to know something about a population, like the population of runners running the marathon, we are tasked with a monumental challenge, asking everyone. Ellie cannot practically ask every marathon runner, so instead, as most researchers do, she uses a sample of marathon runners, as we discussed in the Lesson 3 content. Assuming good sampling techniques are used, Ellie can ask each person in the sample how many miles per week they run. A simple question. Now given that she has the information, how does she turn this into information about her population of all marathon runners?

The distinction between the sample information we have and the population we seek is very important to keep track of. We KNOW the information about the sample. Ellie can calculate the average number of miles run per week for her sample. She can also calculate the standard deviation of the sample and make the appropriate graphs from her sample data. But what does this tell her about the population of runners?

Directly, nothing. Ellie will need to infer information about a population of runners from her sample. But first we need to point out the relationship of sample to populations.

If Ellie were to take many many samples of her population (without replacement, meaning each runner could only be in one sample), eventually she would include every person in the population. While each sample would have its own mean and standard deviation, the mean of the all the means would equal the population mean (remember, at the end of this hypothetical exercise, Ellie has the information on every runner, so the mean of all the runners is the population mean).

This hypothetical exercise produces something referred to as the sampling distribution of the means. Remember, this is a hypothetical exercise. There is no reason a researcher would actually take many many samples eventually arriving at the total population, unless of course, that research sets out to take a census of the entire population (in which case inferential statistics are not needed at all because the researcher already knows the population values!)

Ellie wants to know on average how far marathon runners run in any given week. She knows she can sample a portion of the larger population of runners that represent all runners in her area. She also knows that she needs to be careful in obtaining the sample to ensure it is randomly selected and represents the population that she wants. She conducts the observational study and calculates the number of miles each person runs per week. Next, she uses Minitab to calculate the average number of miles for her sample, as well as the standard deviation. But she isn’t quite sure how to use this sample information to answer her original question about the population of runners in her area. Many questions arise for Ellie, including...

• Did she sample the right people?
• How close is her sample mean and standard deviation to the actual population mean and standard deviation?
• What can she ‘safely’ conclude about the population based on this sample?

So this gets confusing right? We are working with POPULATIONS, SAMPLES, and now SAMPLING DISTRIBUTIONS. We have already defined populations and samples. This lesson will take a deeper look at sampling distributions.

Sampling Distribution
The sampling distribution of a statistic is a probability distribution based on a large number of samples of size $$n$$ from a given population.

## Objectives

Upon completion of this lesson, you should be able to:

• Identify the possibility of many samples within a sampling distribution.
• Equate the sum of all samples with the population in a sampling distribution
• Identify the standard deviation of the sampling distribution as the standard error of the sample
• Compute and interpret a confidence interval for means (quantitative data).
• Compute and interpret a confidence interval for proportions (categorical data).