13.1 - Histograms

Example 13-1 Section

The material on this page should look awfully familiar as we briefly investigated histograms in the first lesson of the course. We review them again briefly here.

The following numbers are the measured nose lengths (in millimeters) of 60 students:

38 50 38 40 35 32 45 50 40 32 40 47 70 55 51
43 40 45 45 55 37 50 45 45 55 50 45 35 52 32
45 50 40 40 50 41 41 40 40 46 45 40 43 45 42
45 45 48 45 45 35 45 45 40 45 40 40 45 35 52

Recall that although the numbers look discrete, they are technically continuous. The measuring tools, which consisted of a piece of string and a ruler, were the limiting factors in getting more refined measurements. In most cases, it appears as if nose lengths come in five-millimeter increments... 35, 40, 45, 55... but that's, again, just measurement error. In order to create a histogram of these continuous measurements, we will use the following guidelines.

To create a histogram of continuous data Section

First, you have to group the data into a set of classes, typically of equal length. There are many, many sets of rules for defining the classes. For our purposes, we'll just rely on our common sense — having too few classes is as bad as having too many.

  1. Determine the number, \(n\), in the sample.
  2. Define \(k\) class intervals \((c_0, c_1], (c_1, c_2], \ldots, (c_{k-1}, c_k]\) .
  3. Determine the frequency, \(f_i\), of each class \(i\).
  4. Calculate the relative frequency (proportion) of each class by dividing the class frequency by the total number in the sample — that is, \(\frac{f_i}{n}\).
  5. For a frequency histogram: draw a rectangle for each class with the class interval as the base and the height equal to the frequency of the class.
  6. For a relative frequency histogram: draw a rectangle for each class with the class interval as the base and the height equal to the relative frequency of the class.
  7. For a density histogram: draw a rectangle for each class with the class interval as the base and the height equal to \(h(x)=\dfrac{f_i}{n(c_i-c_{i-1})}\)

Example 13-1 Continued Section

Here's what the work would like for our nose length example if we used 5 mm classes centered at 30, 35, ... 70:

 
Class Interval Tally Frequency Relative Frequency Density Height
27.5-32.5 || 2 0.033 0.0066
32.5-37.5 ||||| 5 0.083 0.0166
37.5-42.5 ||||| ||||| ||||| || 17 0.283 0.0566
42.5-47.5 ||||| ||||| ||||| ||||| | 21 0.350 0.0700
47.5-52.5 ||||| ||||| | 11 0.183 0.0366
52.5-57.5 ||| 3 0.183 0.0366
57.5-62.5   0 0 0
62.5-67.5   0 0 0
67.5-72.5 | 1 0.017 0.0034
    60 0.999 (rounding)

And, here is what the density histogram would like:

Density Nose lengths (mm) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 30 35 40 45 50 55 60 65 70

Note that a density histogram is just a modified relative frequency histogram. A density histogram is defined so that:

  • the area of each rectangle equals the relative frequency of the corresponding class, and
  • the area of the entire histogram equals 1.

Empirical Rule Section

We've previously learned that the sample mean can be thought of as the "center" of a set of data, while the sample standard deviation indicates "how spread out" the data are from the sample mean. Now, if a histogram is "mound-shaped" or "bell-shaped," then we can use the sample mean, sample standard deviation, and what is called the Empirical Rule to determine three intervals for which we would expect approximately 68%, 95%, and 99.7% of the data to fall.

The Empirical Rule tells us that if a histogram is at least approximately bell-shaped, then:

  1. Approximately 68% of the data are in the interval:

    \((\bar{x}-s,\bar{x}+s)\)

  2. Approximately 95% of the data are in the interval:

    \((\bar{x}-2s,\bar{x}+2s)\)

  3. Approximately 99.7% of the data are in the interval:

    \((\bar{x}-3s,\bar{x}+3s)\)

Example 13-2 Section

The federal government's average income from federal income taxes (on a per capita basis) for each of the 50 states in fiscal year 1991 is \$1252.44 with a standard deviation of \$393.75. Assuming the data are approximately bell-shaped, use the Empirical Rule to determine three intervals for which we would expect approximately 68%, 95%, and 99.7% of the data to fall.

Solution

The Empirical Rule tells us that we can expect 68% of the per capita taxes to fall between:

\(\bar{x}-s=\$ 1252.44-\$ 393.75=\$ 858.69\) and \(\bar{x}+s=\$ 1252.44+\$ 393.75=\$ 1646.19\)

The Empirical Rule also tells us that we can expect 95% of the per capita taxes to fall between:

\(\bar{x}-2s=\$ 1252.44-2(\$ 393.75)=\$ 464.94\) and \(\bar{x}+2s=\$ 1252.44+2(\$ 393.75)=\$ 2039.94\)

The Empirical Rule also tells us that we can expect 99.7% (virtually all!) of the per capita taxes to fall between:

\(\bar{x}-3s=\$ 1252.44-3(\$ 393.75)=\$ 71.19\) and \(\bar{x}+3s=\$ 1252.44+3(\$ 393.75)=\$ 2433.69\)