3.3 - Numbers: Summarizing Measurement Data

3.3 - Numbers: Summarizing Measurement Data

We will discuss two important ways to summarize measurement data. These include:

  1. measures of center (where the data are located along the number line)
  2. measure of spread (how much variation there is about the center)

To represent the center of a list of measurement data we focus on:

  1. The mean (the numerical average)
  2. The median (the 50th percentile)

Example 3.5: Measures of Center

Consider the following sample for 5 Selected PSU Students (n = 5)

Monthly Movie Rentals
1 5 1 4 2

Suppose you want to find a number to represent the center of the data. The first choice would be the mean. The mean is also known as the average. The mean is found by obtaining a sum of all the observations and dividing by the sample size (n). In this instance:

mean = (1 + 5 + 1 + 4 + 2) / 5 = 13 / 5 = 2.6 movie rentals/month

Another possibility is the median. The median is the middle value of a sample when the observations are sorted from smallest to largest.

Monthly Movie Rentals (sorted)
1 1 2 4 5

In this example, the middle observation is 2 so the median = 2.0 movie rentals/month.

As you examine how the mean and median were calculated, hopefully, you notice that the two methods are very different. The mean is an example of a sensitive measure because all observations were used in the calculation so it is sensitive to large or small numbers away fro most of the other values. In contrast, the median is an example of a resistant measure because only the middle observation was used to determine its value.

Example 3.6: Which Measure of Center to Use

Consider the following sample of Annual Salaries for 20 Selected Employees at a Local Company

Salaries (Sorted)
30000 32000 32000 33000 33000 34000 34000 38000 38000 38000 42000
43000 45000 45000 48000 50000 55000 55000 65000 110000    

The mean for this sample is \$45,000 while the media is \$40,000. (Note: because the sample size is an even number, the median is the average of the middle two numbers, which in this case are \$38,000 and \$42,000). Even though we can always determine both the mean and median, one must determine which measure is more appropriate to use when there is a large difference between the two measures of center. In this instance, there is a difference of \$5,000 between the two measures, so one should decide which measure of center is more appropriate to use. To help you understand what is happening, look at the histogram found below.

The histogram of salaries is right skewed with larger percent of the salaries located on the lower tail.

Figure 3.5. Histogram (Salaries)

As you can see, the histogram is right-skewed because a larger percentage of the salaries are located on the lower tail. The very large salary of \$110,000 is largely responsible for the histogram is right-skewed. With right-skewed histograms, the mean will be greater than the median, because the mean is sensitive to the large salary of \$110,000 and is pulled in the direction of the unusually large observation. In contrast, the median, which is the middle value of the data set, is resistant to any extreme observations because these observations are not used to determine its value. Table 3.3 summarizes the link between the two measures of center and histogram shape.

Table 3.3. Link between Measures of Center and Histogram Shape

Histogram Shape Compare Two Measures Of Centers
If symmetric mean and median are approximately equal
If right skewed mean is greater than the median
If left skewed mean is less than median

The left graph shows the symmetric distribution with mean equal to median and a single peak. The middle graph shows the right-skewed distribution with mean greater than median. The right graph shows the left-skewed distribution with mean less than median.

Figure 3.6. Different Distributions

So, getting back to the question of which measure of center is more appropriate to use. When you have skewed data, the mean is somewhat misleading as a representative value. The mean can be pulled in one direction or the other by outliers. Generally, when the data is skewed, the median is more appropriate to use as the measure of a typical value. We generally use the mean as the measure of center when the data is fairly symmetric. In deciding which measure to use, we must also confront the issue of validity - that is what is most relevant for the problem at hand. For example, if we are interested in the total income for a country, we would look to per-capita income (the mean). But if we are interested in the income of a typical citizen, we would look to the median income.

It is often important to be given both measures of center. For example, the difference between the mean and median is important since the direction and magnitude of that difference helps a person envision the likely shape of the histogram as indicated in Table 3.3. and in the plots shown in Figure 3.6.

As stated above, the question being asked can also affect which measure of the center can be considered more typical and therefore, more appropriate. Although we would normally use the median with skewed data, there may be cases where we might use the mean as a more typical measure of center. It all depends on the question being asked and on the shape of the data. For example, given the right-skewed data for the company in Example 3.6:

  • If you want to know how much employees at the company in Example 3.6 pay in social security taxes the mean might represent the typical salary figure better than the median since it accounts for the total pay on which the taxes are based. In this case, the median salary figure may not be as appropriate as the mean salary figure.
  • However, if you are applying for an entry-level position within the company in Example 3.6, and want to know what a typical employee makes, the median salary figure would represent the typical salary figure better than the mean and be more appropriate to use.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility