3.4 - Five Useful Numbers (Percentiles)

A percentile is the position of an observation in the dataset relative to the other observations in the data set. Specifically, the percentile represents the percentage of the sample that falls below this observation. For example, the median is also known as the
50th percentile because half of the data or 50% of the observations lie below the median. There are three percentiles that will be of interest to us.  Figure 3.7 shows these percentiles (quartiles) graphically.

Percentiles of Interest

Percentile Alternate Names Interpretation 25th percentile

  • Lower Quartile (QL)
  • First Quartile (Q1)

25% of the data falls below this percentile 50th percentile

  • Median
  • Second Quartile ( Q2)

50% of the data falls below this percentile 75th percentile

  • Upper Quartile (QU)
  • Third Quartile (Q3)

75% of the data falls below this percentile

The graph illustrates the quartiles of a distribution. Q1, Q2 and Q3 divide the whole distribution into four equal parts.

Figure 3.7. Quartiles for a Distribution

A five-number summary is a useful summary of a data set that is partially based on selected percentiles. Below are the five numbers that are found in a five-number summary

Minimum (lowest) LowerQuartile Median UpperQuartile Maximum (highest)

Figure 3.8. Five-Number Summary

Example 3.7. Five-Number Summary Section

Recall the sample that was used in the previous example.

Sample: The Annual Salaries for 20 Selected Employees at a Local Company

Salaries (Sorted)
30000 32000 32000 33000 33000 34000 34000 38000 38000 38000 42000
43000 45000 45000 48000 50000 55000 55000 65000 110000    

Table 3.5. Five-Number Summary of Salaries

Lowest Lower Quartile (QL) Median Upper Quartile (QU) Highest
$30,000 $33,500 $40,000 $49,000 $110,000

Below are possible questions that can be answered with this five number summary.

5 Number Summary Section

  1. What percent of the salaries lie below $49,000?

    Answer: 75%
    Reason: $49,000 represents the 75th percentile or upper quartile

  2. What percent of the salaries lie above $40,000?

    Answer: 50%
    Reason: $40,000 represents the 50th percentile so 50% of the observations lie below this percentile and 50% lie above this percentile

  3. What percent of the salaries lie between 33,500 and 49,000 dollars?

    Answer: 50%
    Reason: asking for percent of observations that lies between the 25th percentile and the 75th percentile (75% - 25% = 50%)

Boxplots Section

The five-number summary is also of value because it is the basis of the boxplot. Figure 3.9 below is a vertical boxplot of the variable salaries. The first thing to consider in this graph is the box. The ends of the box locate the lower quartile and upper quartile, which in this case are 33,500 and 49,000 dollars respectively. The line in the middle of the box is the median. As you examine the box portion of the box, you should notice that the median is closer to the lower quartile than to the upper quartile. This suggests that data set is skewed and specifically skewed to the right. In this instance the largest observation is represented with an asterisk. Since this observation is an unusually large salary of $110,000, the graph identifies this observation as an outlier or unusual observation. Appropriate statistical criterion is used to determine whether or not an observation is an outlier. Lines called 'whiskers' extend from the box out to the lowest and highest observations that are not outliers. Notice that the whisker on the bottom is much shorter than the whisker on the top of this boxplot. This is another hallmark of a distribution that is skewed to the right (because the first 25% of the data covers a narrow length on the number line while the last 25% are more spread out.

The horizontal boxplot shows the distribution of salaries, which is constructed by drawing a box between Q1 and Q3. The line in the box stands for the place of median. And the line up the box extends to the maximum data value within the upper limit. The line below the box extends to the lowest value within the lower limit. An asterisk indicates a potential outlier.

Figure 3.9. Horizontal Boxplot of Salaries

One of the most important uses of the boxplot is to compare two or more samples of one measurement variable.

Example 3.8. Using Boxplots for Comparisons Section

Recall Example 1.7 from Lesson 1. Consider two different wordings for a particular question:

Wording 1: Knowing that the population of the U.S is 270 million, what is the population of Canada?

Wording 2: Knowing that the population of Australia is 15 million, what is the population of Canada?

The results from these questions are displayed on side-by-side boxplots found in Figure 3.10.

Two boxplots show the distributions of answers to two wordings about Canada's population, respectively.

Figure 3.10. Boxplots of Canada's Population by Wording

Four comparisons can be made with side-by-side boxplots. One can then compare the

  1. centers: medians
  2. amount of spread (variation): lengths of the box
  3. shape: the position of the median in the box relative to the quartiles shows whether the data are skewed left, skewed right, or symmetric
  4. number of outliers

With this example, the median for those who had Wording 1 is larger than the median found with Wording 2. One also finds that the length of the box for Wording 1 is also larger than that found with Wording 2. This suggests that there is more spread or variation in the responses for Wording 1. The median is also not positioned in the same place in each box that indicates that the two samples do not have the same shape. Finally, there are two outliers with Wording 2 while there are none with Wording 1. Overall, these findings suggest that the wording of the question does affect the responses that are obtained.

While boxplots do not show the whole distribution like a histogram they are particularly useful for comparing groups since they are thin graphs that can easily be laid side-by-side. However, they have limits. They can not show if a distribution is bimodal or if there are spikes in the histogram at selected values. For example, if you ask a group of adults their heights you might see a bimodal distribution arising from the heights of women in a group with a lower peak and the heights of the men in an overlapping group with a higher peak. The tendency for people to round off that creates spikes in the histogram would not show up in a box plot of the same data (for example many men often say they are six feet tall when they are really 5'11").