4: Bell-Shaped Curves and Statistical Pictures

4: Bell-Shaped Curves and Statistical Pictures

Lesson Overview

In Lesson 4 we continue our discussion of describing data through numerical summaries and also think about statistical pictures.  An overview of some key questions addressed is given in the table below.

Question Addressed Statistical Summary
Where are values located along the number line?

Median

Mean

How variable are the numbers?

IQR

Standard Deviation

What is the relative standing of an individual value compared with other numbers on a list?

Percentiles

Standardized Scores

Question Addressed Statistical Picture
How are all of the numbers on a list distributed over the number line?

Dot plots

Boxplots

Histograms
(normal or bell-shaped curve is a special case)

How is a categorical variable distributed?

How do categorical variables compare?

Bar graphs

Comparative bar graphs

How do the distributions of two measurement variables compare?

Comparative boxplots

Comparative dotplots

How do percentages or averages change over time? Line graph or time series plot
How are two measurement variables associated? Scatterplot

Objectives

After successfully completing this lesson, you should be able to:

  • Interpret standard scores as a measure of relative standing on a list.
  • Apply the relationship between standard scores and the percentiles of a normal distribution.
  • Interpret graphs used with categorical data.
  • Interpret scatter plots.
  • interpret time series plots and recognize trends and seasonality.
  • critique whether graphical presentations provide a fair summary of the data.

4.1 - Standard Scores

4.1 - Standard Scores

Standardized Scores (also called "standard scores" or "z-scores")

In Lesson 3 we learned that the standard deviation provides a measure of variability about the mean. Generally, most observations are within one standard deviation of the mean, and observations more than three standard deviations away from the mean are very rare. Thus, the standard deviation provides a natural yardstick by which to gauge where an observation stands relative to others. If you are five standard deviations above the mean then you know you are at the top of the list; one standard deviation below the mean and you know you are on the low side but not too far down. The standardized score is a measure of relative standing on a list, it is just the number of standard deviations above (+) or below (-) the mean you are. To compute the standardized score of a value, you take

Standardized Score (Z-score) formula
\(z = \text{standardized score} = \dfrac{\text{(value - mean)}}{\text{standard deviation}}\)

These numbers are called "standardized" because the list of standardized scores itself always has a mean of 0 and a standard deviation of 1.0. That's because subtracting the mean from every value makes the new mean equal zero and dividing every value by the standard deviation makes the new standard deviation equal to 1.

Example 4.1

Speedometer

According to EPA data, the gas mileage for compact SUVs in the 2013 model year has a mean of approximately 22 mpg and a standard deviation of about 3 mpg. One SUV gets 25 mpg. Thus, its standardized score is z = (25 - 22)/3 = 1. It is one standard deviation above the mean. Another SUV gets 20.5 mpg. Thus, its standardized score is z = (20.5 - 22)/3 = -0.5. It is one-half of a standard deviation below the mean.

The standardized scores give you a way to compare relative standing of values on different lists where the distributions might have roughly similar shapes.

Example 4.2

According to EPA data, 4-cylinder 2013 model year cars have CO2 emissions that average 333 ppm (parts per million) with a standard deviation of 51 ppm; while 6-cylinder cars made that year average 431 ppm with a standard deviation of 44 ppm. Which vehicle has higher CO2 emissions relative to other cars with the same number of cylinders; the 4-cylinder Honda Civic that emits 284 ppm or the 6-cylinder Toyota Camry that emits 358 ppm?

The Honda Civic would have a z-score of \(\dfrac{(284-333)}{51}=-0.96\) while the Toyota Camry would have a z-score of \(\dfrac{(358-431)}{44}=-1.66\). The Honda Civic has a higher relative CO2 emissions.

4.2 - The Normal Curve

4.2 - The Normal Curve

Many measurement variables found in nature follow a predictable pattern. The predictable pattern of interest is a type of symmetry where much of the distribution of the data is clumped around the center and few observations are found on the extremes. Data that has this pattern are said to be bell-shaped or have a normal distribution. It can be shown that variables that arise as a result of the sum or average of a fixed number of individual smaller components of a similar nature will have this shape. Thus, the distribution of the weights of cartons of large eggs at a grocery store will look like a normal curve because the weight of a carton arises from the sum of the weights of the dozen eggs inside. Many measures used by psychologists to gauge levels of characteristics like stress or anxiety or happiness are based on questionnaires that score your answers to lots of individual questions and then sum them up to get a final measure. The distributions of such measures within a homogeneous group of people will then approximately follow a normal curve

Example 4.3: Normal Curves

Consider the following three variables from data that was collected from a sample of n = 198 Stat 100 students:

  • Variable #1: Heights (inches)
  • Variable #2: Grade Point Average
  • Variable #3: Number of Tattoos
0 10 20 30 40 50 60 70 80 Height (inches) Frequency (count)

Figure 4.1. Histogram of Height (Mean = 66.3 inches & Median = 66 inches)

 

The Heights Variable is a great example of a histogram that looks approximately like a normal distribution as shown in Figure 4.1. Since a normal distribution is a type of symmetric distribution, you would expect the mean and median to be very close in value. With this example, the mean is 66.3 inches and the median is 66 inches.

0 10 20 3 4 2 GPA Frequency (count)

Figure 4.2. Histogram of GPA (Mean = 3.25 & Median = 3.3)

 

The GPA Variable that gives the Grade Point Averages of these 198 Stat 100 students is slightly skewed left and could only very roughly be said to follow a normal distribution as shown in Figure 4.2. Notice the upper tail where the data is clumped. This can be partially explained by the fact that GPAs at Penn State cannot exceed 4.0. However, the mean and median are still pretty close, and using the normal curve (to calculate percentiles for example) should give very rough approximations. It is likely that the GPA variable would look more like a normal curve if the data were restricted to a more homogeneous group with a similar number of credit hours taken.

0 100 200 5 10 0 Number of tattoos Frequency (count)

Figure 4.3. Number of Tattoos (Mean = .23 & Median = 0)

 

The Tattoo Variable is not normally distributed at all as shown in Figure 4.3. The major problem with this variable is that it is extremely skewed to the right since most people have no tattoos at all. Also, the graph has gaps because this variable is discrete with only a few values in the data set. Thus, the normal curve should not be used to make even rough approximations for data about the number of tattoos.

The Empirical Rule

The empirical rule is a guideline that can be applied when you know that the sample is approximately normally distributed. The empirical rule also helps one to understand what the standard deviation represents.

The empirical rule says that for any normal (bell-shaped) curve, approximately:

  • 68%of the values (data) fall within 1 standard deviation of the mean in either direction
  • 95%of the values (data) fall within 2 standard deviations of the mean in either direction
  • 99.7%of the values (data) fall within 3 standard deviations of the mean in either direction
The normal curve showing the empirical rule.
mean−2s mean−1s mean+1s mean−3s mean+3s mean mean+2s 68% 95% 99.7%

Figure 4.4 The Empirical Rule

Example 4.4: Empirical Rule

Recall the variable heights used in Example 4.3. Since the histogram shows that this data is normally distributed, the empirical rule can be applied. The mean and standard deviation (SD) for this sample is 66.3 inches and 4 inches, respectively. Below are the calculations for the sample of heights.

Mean ± 1(SD) = 66.3 ± 4 inches = (62.3 to 70.3 inches)
Mean ± 2(SD) = 66.3 ± 2(4) inches = 66.3 ± 8 inches = (58.3 to 74.3 inches)
Mean ± 3(SD) = 66.3 ± 3(4) inches = 66.3 ± 12 inches = (54.3 to 78.3 inches)

Because the sample of heights is normally distributed, one can say that approximately

  • 68% of the heights lie between 62.3 and 70.3 inches
  • 95% of the heights lie between 58.3 and 74.3 inches
  • 99.7% of the heights lie between 54.3 and 78.3 inches

One would expect it to be very unusual for someone in this sample to be smaller than 54.3 inches or taller than 78.3 inches. Since 68% of the heights are within one standard deviation of the mean, the remaining 32% would fall outside of that. Further, since the distribution is symmetric we would have 16% (half of the 32%) falling below 62.3 inches and another 16% falling above 70.3 inches.

Note!

An important feature of the normal curve is that percentiles are completely determined by the standardized scores. Table 8.1 on in Chapter 8 in the textbook (page 175) shows the standard scores that align with various percentiles. As examples, examine the table to check that the 23rd percentile goes with a standard score of z = -0.74 and the 97th percentile goes with a standard score of z = 1.88.

Example 4.5

A histogram of the highway gas mileage for the 171 compact SUVs sold in the United States and tested by the EPA in 2013 is shown in Figure 4.5. The mean mileage was 22.20 mpg with a standard deviation of 2.85 mpg. General Motors' 2013 Encore compact SUV got 28 mpg. What percentage of the compact SUVs got worse mileage than the Encore?

10 5 10 15 20 25 20 25 30 35 Highway Mileage (mpg)

Figure 4.5 Histogram of Highway Mileage for 2013 compact SUVs

 

To solve this, we first have to compute the standard score of the value of interest which is found by:

\(z=\dfrac{(28 - 22.2)}{2.85} ≈ 2.04\) (this says that 28 mpg is 2.04 standard deviations above the mean).

Next, we look at Table 8.1 to find that this standard score corresponds to approximately the 98th percentile of a normal distribution. Thus, the Encore gets better mileage than about 98% of the 2013 compact SUVs (and hence worse mileage than about 2% of them).


4.3 - Statistical Pictures

4.3 - Statistical Pictures

In this section, we examine a few important types of statistical pictures: bar graphs, time series plots, and scatterplots.

Before turning to these specific types of statistical pictures, it is important to note that regardless of the type of picture being used, there are some basic features that a good graph will possess:

  • The data should be clearly recognizable from the background
  • The picture should be clearly labeled, showing
    • the title and purpose or origin of the data,
    • what is being plotted on each axis, bar, or segment of the plot (i.e. the variables being presented)
    • the scale including starting points and units of the measurement
  • The picture should have as little extraneous material as possible (i.e. a high "information content")

Section 9.5 of the text (pages 190 to 194) provides illustrations ofthe difficulties that arise withpoor statistical graphics that do not follow these basic guidelines and create ambiguity and confusion in their interpretation..

Example 4.6: Life Satisfaction

The Gallup World Poll takes random samples of the adults in 132 different countries. In many of those countries, Gallup asks respondents to try and think about an overall evaluation of their lives and to specify how satisfied they are on a four-point scale (very dissatisfied, dissatisfied, satisfied, or very satisfied). Figure 4.6 shows a comparative bar graph giving the results from two surveys, one taking a random sample of adults in the United States and one using a random sample of adults in China. Bar graphs are often used to show the results of data for categorical variables and, as in Figure 4.6 can be used to compare categorical variables in two or more circumstances.

204060Satisfaction with LivesPercent of RespondentsVeryDissatisfiedSomewhatDissatisfiedSomewhatSatisfiedVerySatisfiedChinaUS

Figure 4.6 Life Satisfaction Results from Two Countries (Gallup World Poll)

Example 4.7 Consumer Spending

Each day the Gallup Poll takes a random sample of about 500 American adults nationally and asks them about a variety of issues including how much money they spent the day before not counting the purchase of a home or car or paying normal household bills like for electrical or phone service. Figure 4.7shows a line graph of the data presented in two ways: both as a 3-day rolling average (the dark green line) and as a 14-day rolling average (the light green line). For example, each point on the dark green line represents the average results of the amount spent by the 1500 American adults who had responded to the survey over the 3-day period leading up to the day of the survey. This type of line graph is called a time series plot because the points represent the variable being measured across time.

$0 $50 $100 $150 $200 8/2016 9/2016 11/2016 12/2016 1/2017 3/2017 4/2017 5/2017 6/2017 7/2017 3-day rolling average 14-day rolling average Dollars per day

Figure 4.7 Time Series Plot of Consumer Spending from February 2008 to February 2015 (Gallup Poll)

 

When looking at a time series plot like Figure 4.7 it is important to examine and interpret some key basic features:

  • Is there a long-term trend? For example, in the consumer spending data, we can see a long-term generally upward trend since the end of the recession in June 2009. One note of caution when looking at economic data that extends over decades in time; check if they have been adjusted for inflation. An apparent upward trend may be nothing more than reflecting a change in the value of the dollar.
  • Are there seasonal components? While temperature data is dramatically affected by regular seasonal cycles, many other variables change in predictable patterns because of people's behavioral changes in certain months or seasons. For example, have a close look at Figure 4.7. You should be able to see a bump in consumer spending each year associated with the holidays in December. There are other cyclic effects in this data. If you look really closely at the 3-day averages, you can see that there is increased spending on weekends compared with weekdays.
  • What is the nature of the random fluctuations? We know that every measurement is subject to natural variability and that averages will be more reliable if they are based on larger sample sizes. Have a look at Figure 4.7 and see how the 3-day averages based on surveys of about 1500 people show random fluctuations much larger than the 14-day averages based on surveys of about 7000 people.

Example 4.8 Blood Alcohol Content (BAC)

breathalyzer test

An experiment was carried out to see how Blood Alcohol Content (BAC) as measured by a breathalyzer change with the number of 12-ounce beers you drink (the experiment is discussed in the Electronic Encyclopedia of Statistics Examples and Exercises). In the experiment, 16 subjects each drew a number out of a hat. For example, if the number was a 3, then that subject drank 3 beers. A half-hour after finishing the last assigned beer a police officer used a breathalyzer, like the ones they use in the field, to measure the subject's BAC level. Figure 4.8 shows a scatterplot of the results. Each point represents a different subject. For example, one subject drank 6 beers and had a BAC of 0.10; over the legal limit for driving.

2 4 6 8 0.04 0.08 0.12 0.16 Number of beers consumed Blood alcohol content (BAC)

Figure 4.8 Number of 12-ounce Beers Consumed versus BAC for 16 Subjects

Scatterplots are used for displaying the relationship between two measurement variables. Examining Figure 4.8, we can see a clear trend - there is an obvious positive association between the number of beers and the BAC - increases in one variable are associated with increases in the other. The data here were based on a randomized experiment and the causal nature of this particular relationship is quite well established. Positive associations are reflected in a cloud of points in the scatterplot that goes "uphill" as you move from left to right. A negative association, like the one we would see if we plotted the weights of cars versus their gas mileage, shows a cloud of points going "downhill".


4.4 - Test Yourself!

4.4 - Test Yourself!

Think About It!

Select the answer you think is correct - then click the right arrow to proceed to the next question.


4.5 - Have Fun With It!

4.5 - Have Fun With It!

Have Fun With It!

cartoon about bias and reliability, "The day I realized that I could cook bacon whenever I wanted!"
XKCD.com ©

Standard Score

lyrics ©2005-2006 Lawrence Mark Lesser (to the tune of "Row, Row, Row Your Boat")

Find a standard score,
Also known as "z":

It is the number of standard devs

You're above the mean!


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility