3.4  Five Useful Numbers (Percentiles)
3.4  Five Useful Numbers (Percentiles)A percentile is the position of an observation in the dataset relative to the other observations in the data set. Specifically, the percentile represents the percentage of the sample that falls below this observation. For example, the median is also known as the
50th percentile because half of the data or 50% of the observations lie below the median. Table 3.4 displays three percentiles that will be of interest to us. Figure 3.7 shows these percentiles (quartiles) graphically.
Table 3.4. Percentiles of Interest
Percentile  Alternate Names  Interpretation 

25th percentile 

25% of the data falls below this percentile 
50th percentile 

50% of the data falls below this percentile 
75th percentile 

75% of the data falls below this percentile 
Figure 3.7. Quartiles for a Distribution
A fivenumber summary is a useful summary of a data set that is partially based on selected percentiles. Below are the five numbers that are found in a fivenumber summary.
Figure 3.8. FiveNumber Summary
Example 3.7. FiveNumber Summary
Recall the sample that was used in the previous example.
Sample: The Annual Salaries for 20 Selected Employees at a Local Company
30000  32000  32000  33000  33000  34000  34000  38000  38000  38000  42000 
43000  45000  45000  48000  50000  55000  55000  65000  110000 
Table 3.5. FiveNumber Summary of Salaries
Lowest  Lower Quartile (QL)  Median  Upper Quartile (QU)  Highest 

$30,000  $33,500  $40,000  $49,000  $110,000 
Below are possible questions that can be answered with this five number summary.
5 Number Summary

What percent of the salaries lie below $49,000?
Answer: 75%
Reason: $49,000 represents the 75th percentile or upper quartile 
What percent of the salaries lie above $40,000?
Answer: 50%
Reason: $40,000 represents the 50th percentile so 50% of the observations lie below this percentile and 50% lie above this percentile 
What percent of the salaries lie between 33,500 and 49,000 dollars?
Answer: 50%
Reason: asking for percent of observations that lies between the 25th percentile and the 75th percentile (75%  25% = 50%)
Boxplots
The fivenumber summary is also of value because it is the basis of the boxplot. Figure 3.9 below is a vertical boxplot of the variable salaries. The first thing to consider in this graph is the box. The ends of the box locate the lower quartile and upper quartile, which in this case are 33,500 and 49,000 dollars respectively. The line in the middle of the box is the median. As you examine the box portion of the box, you should notice that the median is closer to the lower quartile than to the upper quartile. This suggests that data set is skewed and specifically skewed to the right. In this instance the largest observation is represented with an asterisk. Since this observation is an unusually large salary of $110,000, the graph identifies this observation as an outlier or unusual observation. Appropriate statistical criterion is used to determine whether or not an observation is an outlier. Lines called 'whiskers' extend from the box out to the lowest and highest observations that are not outliers. Notice that the whisker on the bottom is much shorter than the whisker on the top of this boxplot. This is another hallmark of a distribution that is skewed to the right (because the first 25% of the data covers a narrow length on the number line while the last 25% are more spread out.
Figure 3.9. Horizontal Boxplot of Salaries
One of the most important uses of the boxplot is to compare two or more samples of one measurement variable.
Example 3.8. Using Boxplots for Comparisons
Recall Example 1.7 from Lesson 1. Consider two different wordings for a particular question:
Wording 1: Knowing that the population of the U.S is 270 million, what is the population of Canada?
Wording 2: Knowing that the population of Australia is 15 million, what is the population of Canada?
The results from these questions are displayed on sidebyside boxplots found in Figure 3.10.
Figure 3.10. Boxplots of Canada's Population by Wording
Four comparisons can be made with sidebyside boxplots. One can then compare the
 centers: medians
 amount of spread (variation): lengths of the box
 shape: the position of the median in the box relative to the quartiles shows whether the data are skewed left, skewed right, or symmetric
 number of outliers
With this example, the median for those who had Wording 1 is larger than the median found with Wording 2. One also finds that the length of the box for Wording 1 is also larger than that found with Wording 2. This suggests that there is more spread or variation in the responses for Wording 1. The median is also not positioned in the same place in each box that indicates that the two samples do not have the same shape. Finally, there are two outliers with Wording 2 while there are none with Wording 1. Overall, these findings suggest that the wording of the question does affect the responses that are obtained.
While boxplots do not show the whole distribution like a histogram they are particularly useful for comparing groups since they are thin graphs that can easily be laid sidebyside. However, they have limits. They can not show if a distribution is bimodal or if there are spikes in the histogram at selected values. For example, if you ask a group of adults their heights you might see a bimodal distribution arising from the heights of women in a group with a lower peak and the heights of the men in an overlapping group with a higher peak. The tendency for people to round off that creates spikes in the histogram would not show up in a box plot of the same data (for example many men often say they are six feet tall when they are really 5'11").