Lesson 8: The Diversity of Samples

Lesson 8: The Diversity of Samples

Lesson Overview

The Law of Large Numbers (section 7.2) tells us that the results of a chance process repeated over-and-over again independently are unpredictable in the short-run but have a regular and a predictable pattern in the long-run. Averages and proportions from a random sample tend to hone in on their corresponding population values. More stability is associated with more trials and more volatility is associated with fewer trials.

The result is not surprising. If we think about the Expected Value of a measurement as the long-run average value, then it makes sense that we should get closer to that long-run average as we get closer to the idealized "long run".

Example 8.1

According to the most recent census, 38.4% of the adults over 25 years old in Centre County Pennsylvania have a Bachelor's Degree. A survey of Centre County residents is about to be taken to examine their opinions about the "student loan crisis" and the importance of a college education. The researcher's plans call for a random sample to be taken in the hopes that the sample will be representative of the Centre County population. For example, the researcher hopes the proportion of people over 25 years old in her sample with a Bachelor's degree will come close to the known proportion of 0.384 in the population. Since the survey method is unbiased, the sample proportion is expected to come out around the population value give or take a random error.

sample proportion = population proportion + random error

This lesson studies the random error that tells us how far off the sample is from the population. A random sample makes all possible samples of size n equally likely.

  • n = 1: If n = 1 in Example 8.1, we would just pick one person randomly from the population and the possible proportions we might get are just 0 (if the person we pick doesn't have a Bachelor's degree) or 1 (if the person we pick has a degree). A histogram of the proportions we might get from all possible samples of size n = 1 would look like Figure 8.1A.
  • n = 10: Similarly, if we use a sample of size 10, then the possible sample proportions we might get are 0 (if none of the ten have a degree) or 0.1 (if one of the ten has a degree) or 0.2 (if 2 of the ten have degrees), etc... A histogram of all of the different sample proportions we might get with n = 10 is given in Figure 8.1B.
  • n = 100: Finally, a histogram of the sample proportions we might get with all possible samples of size 100 is given in Figure 8.1C.

figure 8-1 possible porportions when n equals 1, 10 or 100Figure 8.1 Probability Histograms of all possible proportions when n=1, 10, or 100 in Example 8.1

With n = 1, the sample proportions go all the way from 0 to 1. With n = 10, the preponderance of sample proportions is between 0.2 and 0.6. With n = 100, the bulk of the proportions are between 0.33 and 0.43. As the sample size grows there is less variation in the proportions we might get - that's the Law of Large Numbers. But there is another pattern emerging in Figure 8.1. As the sample size grows, the histogram of the possible sample proportions takes a familiar shape - with large samples it looks like the normal curve!

Objectives

After successfully completing this lesson, you should be able to:

  • Identify and avoid the gambler's fallacy.
  • Understand the concept of a sampling distribution and how it relates the population parameter to the sample statistic.
  • Apply the normal approximation to sample proportions and means.
  • Examine real-world problems and decide when the normal approximation does and does not apply.

8.1 - There is no "Law of Small Numbers"

8.1 - There is no "Law of Small Numbers"

Cautions with small samples

Watch out for over-interpreting the results from small samples. Our brains seemed hard-wired to over-interpret the information in small samples and believe they hold more knowledge about how the real world works than they possibly can. Believable anecdotes are often interpreted as if they give substantial data about a topic; coincidences are given mystical or causal interpretations; and patterns seen in small samples are thought to be the forerunner of a shift in paradigm.

The Gambler's Fallacy

The Law of Large Numbers does not work by compensation for a run of bad luck. Yet expecting that an independent sequence of trials will be self-correcting is a common misconception. You go to a casino and see red come up five times in a row at the roulette table - should you now bet on black because "it is due." Clearly, the roulette wheel has no memory and each spin provides the same chances for red or black. Succumbing to the misconception that the wheel will somehow compensate for an unusual run of events is known as the "Gambler's Fallacy." It is the flip side of another misconception exemplified by seeing red come up five times in a row at the roulette table and deciding to bet on red because "it is hot."

Outside of the casino, the Gambler's Fallacy is often seen in the habits of small investors in the stock market. If the market is down five days in a row - should you now buy into the market because it is due for an upswing? If the market is up five days in a row - should you now sell your stocks because it is due for a downturn? The chance that the stock market goes up on a randomly chosen day is a little better than 50% (perhaps closer to 51%). However, whether the market will rise on a given day is essentially independent of what happened the day before - it is still 51% regardless of whether the market was up or down the day before. Like the roulette wheel, seeing stocks rise 5 days in a row does not seem to change the chances on what will happen on that sixth day.

Beware of the Anecdote

Anecdotes will often seem compelling, since telling the story of an individual case can make an issue come to life. But moving from a few anecdotes to the general principle is tricky on two fronts - the sample is both small and not chosen randomly from the population.

Coincidences Happen

Coincidence?

A friend has just purchased a new car - a silver-colored Subaru. A few days later she was stopped at a stoplight and the car in front of her was also a silver-colored Subaru. She looks at that car's license plate and sees that the letters are her initials and the numbers are her ATM PIN number. If you hear about an event like that sometime this year that seems like it has a miraculously small probability of occurring, don't be too shocked - such things are bound to happen someday to someone somewhere. In evaluating the rarity of surprising events, keep in mind all of the times you've had the opportunity for an unusual event to be relayed to you when one was not. You probably have many friends or acquaintances that might have told you about a weird coincidence and 364 days in the year when you did not hear about one. Silver Subarus may be a low percentage of the cars on the road but having just purchased one, you become much more likely to notice them in traffic. The letters and numbers on a license plate being her initials and ATM PIN may seem extremely rare but what if the letters had been a three letter word that she had just used in conversation (about 6% of all three-letter combinations form words) and what if the numbers were her birthday month and day or the last four digits of her social security number or the hour and minute on the clock, or .... Any of these would have been seen as remarkable and the totality of remarkable things may be unremarkable.

Our mind likes to find patterns but the chance processes that generate the pattern are usually independent of the past pattern of results. A stretch of six tosses of a coin that land HHHHHH is just as likely as the sequence TTHHTH. The coin does not care that we see a pattern in the first sequence but not in the second one.

There is no "Law of Small Numbers" as small samples are subject to unpredictable volatility and hence not necessarily representative of the population.


8.2 - The Normal Approximation

8.2 - The Normal Approximation

While the behavior of small samples is unpredictable, the behavior of large samples is not. Statistical summaries like proportions and means arising from random samples tend to hone in on the true population value. Further, as we saw in figure 8.1, a frequency curve (probability histogram) showing the distribution of a sample proportion or mean across all possible samples follows the normal curve.

Here's the rule:

Normal Approximation:

The sampling distribution of averages or proportions from a large number of independent trials approximately follows the normal curve. The expectation of a sample proportion or average is the corresponding population value.

The standard deviation of a sample mean is:

\(\dfrac{\text{population standard deviation}}{\sqrt{n}} = \dfrac{\sigma}{\sqrt{n}}\)

The standard deviation of a sample proportion is:

\(\sqrt{\dfrac{\text{population proportion}(1-\text{population proportion})}{n}} =\sqrt{\dfrac{p(1−p)}{n}}\)

Example 8.2

The amount of gas purchased by customers at a gas station averages 12 gallons with a standard deviation of 5 gallons. The average amount purchased by the next 100 customers is then around \(\mu\) = 12 gallons with a standard deviation of about \(\frac{\text{population standard deviation}}{\sqrt{n}} = \frac{5}{\sqrt{100}} = 0.5\) (interpretation: 5 gallons gives the variation from customer to customer; 0.5 gallons gives the variation in the average purchase of 100 customers).

The chance that the next 100 customers average between 11 and 13 gallons is then about 95% (using the empirical rule since between 11 and 13 gallons corresponds to being within two standard deviations of what's expected). This calculation makes the reasonable assumption that the amount of gas purchased by one customer is independent of the amount purchased by another customer.

Example 8.3

Mitt Romney and Barack Obama shaking hands

In the 2012 Presidential Election, President Obama received 52% of the vote in Pennsylvania. On the day of the election the outcome in Pennsylvania was important to the national election outcome so before all of the votes were counted, several pollsters conducted "exit polls" to gauge how the vote turned out and the reasons why people voted as they did. Suppose you conduct an exit poll of 1000 Pennsylvania voters leaving their precinct voting stations or after they had voted by mail. What is the probability that a majority of your sample did not vote for President Obama?

Solution

We know the true population proportion is p = 0.52. So the question is asking about the chances that the sample proportion would come out less than 0.5.

The standard deviation of would be:

\(\sqrt{\dfrac{p(1−p)}{n}} = \sqrt{\dfrac{0.52(0.48)}{1000}}=0.0158\)

Since the population situation is roughly symmetric (0.52 versus 0.48) the distribution of the sample proportion would follow the normal curve. Thus to compute the probability, we calculate the standard score...

\(z = \dfrac{(0.5 - 0.52)}{0.0158} \approx -1.27\)

Finally, using Table 8.1, we find the desired probability is about 10%. Here is a visual representation of what this solution space looks like:

normal distribution plot showing proportion and z score

Figure 8.2. Finding the possible sample proportions of voters that did not vote for Obama using the normal distribution.


8.3 - The Quality of the Normal Approximation

8.3 - The Quality of the Normal Approximation

In section 8.2 we saw that the Normal Approximation works for calculating probabilities about sample proportions or sample means that arise from a large number of independent trials. How large is large? A few rules of thumb are useful:

For sample proportions: the normal approximation works pretty well as long as \((n)(p)\) and \((n)(1-p)\) are both at least five.

For sample means: the normal approximation works pretty well if \(n > 15\) as long as the distribution isn’t too strongly skewed and there are no outliers. It will typically work well for \( n > 25\) or \(30\) if the population distribution is moderately skewed or at least 40 if it is more strongly skewed.

 The key point: The normal approximation works better when the original population distribution is closer to a normal curve.

Example 8.4

The median household income in Franklin County Ohio is about \$43,000 although the average household income is closer to \$54,000. The standard deviation of household incomes is about \$30,000. You pick a random sample of 50 households. What is the chance the average household income in your sample is over \$60,000?

Solution

Because the median and mean are not equal, this suggests that the population distribution of household incomes is quite skewed. However, even though the distribution is not bell-shaped, this is a random sample and the sample size of 50 households is large enough for us to use the normal approximation.

The sampling distribution of the sample mean is approximately normal with mean \$54,000 and...

standard deviation =\(\dfrac{\$30,000}{\sqrt{50}}\) = \$4,243.

The standard score of \$60,000 is...

\(z = \dfrac{(60,000 – 54,000)}{ 4243} = +1.41\)

From Table 8.1, the probability of being larger that +1.4 is about 1 – 0.92 = 0.08 or about 8%.

normal distribution plot showing proportion and z score
Figure 8.3. Finding average household incomes above \$60,000 using the normal distribution.

Example 8.5

You pick a random sample of 4 households. What is the chance the average household income in your sample is over \$60,000?

Solution

Because the distribution is so right-skewed (mean much bigger than median), the normal approximation may not be appropriate for such a small sample size.

Example 8.6

Seventy-five percent (75%) of the flights from Philadelphia International airport leave on time. Would it be appropriate to use the Normal Approximation to calculate the chance that more than 80 of the next 100 flights leave on time?

Solution

No; whether different flights leave on time will not be independent since the reasons for delayed flights often affect many flights (e.g. the weather is bad in Philadelphia or at a major connection).

Example 8.7

The GPAs of the students in a large statistics class average 3.0 with a standard deviation of 0.5 and roughly follow the normal curve. What’s the chance that 4 randomly selected students to have an average GPA between 2.85 and 3.15?

Solution

Even though n is only 4, it is okay to use the normal approximation since the population distribution follows the normal distribution already. Here, the standard deviation of the sample mean is \(\frac{0.50}{\sqrt{4}} = 0.25\). Thus the standard scores are

\(z = \dfrac{(2.85-3) }{ 0.25} = -0.6\) and \(z = \dfrac{(3.15-3)}{ 0.25} = 0.6\)

From Table 8.1 the answer is 72.5% - 27.5% = 45%.

normal distribution plot showing proportion and z score
Figure 8.4. Finding the chance that is students have an average GPA between 2.85 and 3.15 using the normal distribution.

8.4 - Test Yourself!

8.4 - Test Yourself!

Think About It!

Select the answer you think is correct - then click the 'Check' button to see how you did.

Click the right arrow to proceed to the next question.  When you have completed all of the questions you will see how many you got right and the correct answers.


8.5 - Have Fun With It!

8.5 - Have Fun With It!

Have Fun With It!

cartoon about sampling, "Every homeowner's nightmare. Do you support the mayor's housing plan?"

J.B. Landers ©


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility