A percentile is the position of an observation in the dataset relative to the other observations in the data set. Specifically, the percentile represents the percentage of the sample that falls below this observation. For example, the median is also known as the
50th percentile because half of the data or 50% of the observations lie below the median. Table 3.4 displays three percentiles that will be of interest to us. Figure 3.7 shows these percentiles (quartiles) graphically.
Table 3.4. Percentiles of Interest
||25% of the data falls below this percentile|
||50% of the data falls below this percentile|
||75% of the data falls below this percentile|
Figure 3.7. Quartiles for a Distribution
A five-number summary is a useful summary of a data set that is partially based on selected percentiles. Below are the five numbers that are found in a five-number summary.
Figure 3.8. Five-Number Summary
Example 3.7. Five-Number Summary Section
Recall the sample that was used in the previous example.
Sample: The Annual Salaries for 20 Selected Employees at a Local Company
Table 3.5. Five-Number Summary of Salaries
|Lowest||Lower Quartile (QL)||Median||Upper Quartile (QU)||Highest|
Below are possible questions that can be answered with this five number summary.
5 Number Summary Section
What percent of the salaries lie below $49,000?
Reason: $49,000 represents the 75th percentile or upper quartile
What percent of the salaries lie above $40,000?
Reason: $40,000 represents the 50th percentile so 50% of the observations lie below this percentile and 50% lie above this percentile
What percent of the salaries lie between 33,500 and 49,000 dollars?
Reason: asking for percent of observations that lies between the 25th percentile and the 75th percentile (75% - 25% = 50%)
The five-number summary is also of value because it is the basis of the boxplot. Figure 3.9 below is a vertical boxplot of the variable salaries. The first thing to consider in this graph is the box. The ends of the box locate the lower quartile and upper quartile, which in this case are 33,500 and 49,000 dollars respectively. The line in the middle of the box is the median. As you examine the box portion of the box, you should notice that the median is closer to the lower quartile than to the upper quartile. This suggests that data set is skewed and specifically skewed to the right. In this instance the largest observation is represented with an asterisk. Since this observation is an unusually large salary of $110,000, the graph identifies this observation as an outlier or unusual observation. Appropriate statistical criterion is used to determine whether or not an observation is an outlier. Lines called 'whiskers' extend from the box out to the lowest and highest observations that are not outliers. Notice that the whisker on the bottom is much shorter than the whisker on the top of this boxplot. This is another hallmark of a distribution that is skewed to the right (because the first 25% of the data covers a narrow length on the number line while the last 25% are more spread out.
Figure 3.9. Horizontal Boxplot of Salaries
One of the most important uses of the boxplot is to compare two or more samples of one measurement variable.
Example 3.8. Using Boxplots for Comparisons Section
Recall Example 1.7 from Lesson 1. Consider two different wordings for a particular question:
Wording 1: Knowing that the population of the U.S is 270 million, what is the population of Canada?
Wording 2: Knowing that the population of Australia is 15 million, what is the population of Canada?
The results from these questions are displayed on side-by-side boxplots found in Figure 3.10.
Figure 3.10. Boxplots of Canada's Population by Wording
Four comparisons can be made with side-by-side boxplots. One can then compare the
- centers: medians
- amount of spread (variation): lengths of the box
- shape: the position of the median in the box relative to the quartiles shows whether the data are skewed left, skewed right, or symmetric
- number of outliers
With this example, the median for those who had Wording 1 is larger than the median found with Wording 2. One also finds that the length of the box for Wording 1 is also larger than that found with Wording 2. This suggests that there is more spread or variation in the responses for Wording 1. The median is also not positioned in the same place in each box that indicates that the two samples do not have the same shape. Finally, there are two outliers with Wording 2 while there are none with Wording 1. Overall, these findings suggest that the wording of the question does affect the responses that are obtained.
While boxplots do not show the whole distribution like a histogram they are particularly useful for comparing groups since they are thin graphs that can easily be laid side-by-side. However, they have limits. They can not show if a distribution is bimodal or if there are spikes in the histogram at selected values. For example, if you ask a group of adults their heights you might see a bimodal distribution arising from the heights of women in a group with a lower peak and the heights of the men in an overlapping group with a higher peak. The tendency for people to round off that creates spikes in the histogram would not show up in a box plot of the same data (for example many men often say they are six feet tall when they are really 5'11").