1.4 - Measures of Central Tendency

The ability to visually summarize data is effective, but someone like Maria will probably need to present some numerical summaries of her data to use in her reporting. The most common measures to describe data are measures of central tendency.

Mean, Median, Mode Section

A measure of central tendency is an important aspect of quantitative data. It is an estimate of a “typical” value. Maria may be asked for the typical number of children seen per month.

Three of the many ways to measure central tendency are the mean, median and mode.

There are other measures, such as a trimmed mean, that we do not discuss here.

Mean: The mean is the average of data.

NOTE: At this point, we are going to start to use some basic notation to represent numbers as we present formulas and ways of calculating. When you read "Let (some confusing symbols) represent" we are trying to convey the formula in a "generic" way. If this gets confusing, skim over the formulas and pay more attention to the detailed example below!)

Let \(x_1, x_2, \ldots, x_n\) be our sample. (As per the previous note, all we are doing is having the \(x_1, x_2, \ldots, x_n\) represent numbers. We could have easily illustrated this with real values such as (1,2,3,4 and 5)

The sample mean is usually denoted by \(\bar{x}\) (If you are following this correctly, for the values of 1,2,3,4, and 5)\(\bar{x}\) would be 3!)

\(\bar{x}=\sum_{i=1}^n \dfrac{x_i}{n}=\dfrac{1}{n}\sum_{i=1}^n x_i\)

where n is the sample size and \(x_i\) are the measurements. One may need to use the sample mean to estimate the population mean since usually only a random sample is drawn and we don't know the population mean.

Is this notation confusing you? Don't let it get to you. If this is not intuitive focus on the concepts of what the formulas are doing. (in this example, we are adding all of the numbers (represented by the big squiggly E) and dividing by the total number of observations!

Quite simply, Maria would simply calculate the average number of children per month.

The sample mean (\(\bar{x}\)) is a statistic and a population mean (\(\mu\)) is a parameter.

Note on Notation

What if we say we used \(y_i\) for our measurements instead of \(x_i\)? Is this a problem? No. The formula would simply look like this: \(\bar{y}=\sum_{i=1}^n \dfrac{y_i}{n}=\dfrac{1}{n}\sum_{i=1}^n y_i\)

The formulas are exactly the same. The letters that you select to denote the measurements are up to you. For instance, many textbooks use \(y\) instead of \(x\) to denote the measurements. The point is to understand how the calculation that is expressed in the formula works. In this case, the formula is calculating the mean by summing all of the observations and dividing by the number of observations. There is some notation that you will come to see as standards, i.e, n will always equal sample size. We will make a point of letting you know what these are. However, when it comes to the variables, these labels can (and do) vary.

Median: The median is the middle value of the ordered data. Maria might be asked to report the median if she had one or two months with extremely larger or small numbers of children seen at the agency.

The most important step in finding the median is to first order the data from smallest to largest.

Steps to finding the median for a set of data:

Arrange the data in increasing order, i.e. smallest to largest.
Find the location of the median in the ordered data by \(\frac{n+1}{2}\), where n is the sample size.
The value that represents the location found in Step 2 is the median.

Note on Odd or Even Sample Sizes
If the sample size is an odd number then the location point will produce a median that is an observed value. If the sample size is an even number, then the location will require one to take the mean of two numbers to calculate the median. The result may or may not be an observed value as the example below illustrates.

Mode: The mode is the value that occurs most often in the data. It is important to note that there may be more than one mode in the dataset. For Maria, the mode would be the month(s) with the largest number of children seen

Example 1-2: SAT Data

From an SAT data set, we get the following participation rates for the nine South Atlantic states (Region is SA): 74, 79, 65, 75, 71, 74, 64, 73, and 20. In order to find the median we must first rank the data from smallest to largest:

20, 64, 65, 71, 73, 74, 74, 75, 79

To find the middle point we take the number of observations plus one and divide by two. Mathematically this looks like this where n is the number of total observations:

\(\dfrac{n+1}{2}=\dfrac{9+1}{2}=5\)

Returning to the ordered string of data, the fifth observation is 73. Thus the median of this distribution is 73. The interpretation of the median is that 50% of the observations fall at or below this value and 50% fall at or above this value. In this example, this would mean that 50% of the observations are at or below 73 and 50% are at or above 73. If another value was observed, say 88, this would bring the number of observations to ten. Using the formula above to find the middle point would be at 5.5 (10 plus 1 divided by 2). Here we would find the median by taking the average of the fifth and sixth observations which would be the average of 73 and 74. The new median for these ten observations would be 73.5. As you can see, the median value is not always an observed value of the data set.

To find the mean, we simply add all of the numbers and then divide this total by total numbers summed. Mathematically this looks like this where again n is the number of observations:

\(\bar{x}=\dfrac{\sum^n_{i=1}x_i}{n}=\dfrac{74+79+65+75+71+74+64+73+20}{9}=66.11\)

Effects of Outliers Section

One shortcoming of the mean is that means are easily affected by extreme values. Measures that are not that affected by extreme values are called resistant. Measures that are affected by extreme values are called sensitive. As stated, Maria would use the median if she felt her numbers were could be impacted by outliers because the median is resistant to outliers.

Adding and Multiplying Constants Section

What happens to the mean and median if we add or multiply each observation in a data set by a constant?

Consider for example if an instructor curves an exam by adding five points to each student’s score. What effect does this have on the mean and the median? The result of adding a constant to each value has the intended effect of altering the mean and median by the constant.

For example, if in the above example where we have 9 participation rates for the South Atlantic states, if 5 was added to each participation rate the mean of this new data set would be 71.11 (the original mean of 66.11 plus 5) and the new median would be 78 (the original median of 73 plus 5).

Similarly, if each observed data value was multiplied by a constant, the new mean and median would change by a factor of this constant. Returning to the 9 participation rates, if all of the original rates were multiplied by 1.20 (a 20 percent increase), then the new mean and new median would be found by multiplying the original mean and median by 1.20. As we will learn shortly, the effect is not the same on the variance!

Shape and Central Tendency Section

The shape of the data helps us to determine the most appropriate measure of central tendency. The three most important descriptions of shape are Symmetric, Left-skewed, and Right-skewed. Skewness is a measure of the degree of asymmetry of the distribution. Maria might want to examine the shape of the distribution of the number of children seen.

Symmetric

mean, median, and mode are all the same here
no skewness is apparent
the distribution is described as symmetric

Left-Skewed or Skewed Left

mean < median
long tail on the left

Right-skewed or Skewed Right

mean > median
long tail on the right

Note! When one has very skewed data, it is better to use the median as a measure of central tendency since the median is not much affected by extreme values.

Uses and Abuses of Summaries Section

Descriptive statistics allow Maria to show her data using pictures, however as pointed out with the pie chart, not all presentations accurately portray the data. Since Maria is also balancing her reporting obligations to her funding needs, she might be tempted to present her data to convey very high usage rates or successes for her services. To avoid the temptation to misuse or misrepresent data, Maria needs to consider some of the ethics in statistics.