2.2 - The Empirical Rule

While the standard deviation gives us a measure of variability (I.e. a variable with a standard deviation of zero has no variability), there are other important ways we can think about the standard deviation.

First, we start with a normal distribution, symmetrical and bell-shaped. 

The Empirical Rule is a rule telling us about where an observation lies in a normal distribution. The Empirical Rule states that approximately 68% of data will be within one standard deviation of the mean, about 95% will be within two standard deviations of the mean, and about 99.7% will be within three standard deviations of the mean

Example of Empirical Rule

Suppose Susie observes a sample of n = 200 hospitals with a more or less bell-shaped distribution with a sample mean of = 35 heart attacks per month and a standard deviation s = 6.

About 68% of the hospitals report heart attack rates in the interval 35 ± 6, which is 29 to 41.
About 95% of the hospitals report heart attack rates in the interval 35 ± (2 ×6), which is 23 to 47.
About 99.7% of the hospitals report heart attack rates in the interval 35 ± (3 ×6), which is 17 to 53

Z scores Section

While the Empirical Rule is a great tool for helping us understand the variability of our data and how extreme any one observation is, as in the example calculating the 68, 95 and 99.7 percentiles is somewhat labor intensive. An easier way to calculate where a score falls is to use a standardized or Z score.

We can convert any normal distribution into the standard normal distribution in order to find probability and apply the properties of the standard normal. In order to do this, we use the z-value.

Z-value, Z-score, or Z

The Z-value (or sometimes referred to as Z-score or simply Z) represents the number of standard deviations an observation is from the mean for a set of data. To find the z-score for a particular observation we apply the following formula:

\(Z = \dfrac{(observed\ value\ - mean)}{SD}\)

Let's take a look at the idea of a z-score within context.

For a recent final exam in STAT 800, the mean was 68.55 with a standard deviation of 15.45.

If you scored an 80%: \(Z = \dfrac{(80 - 68.55)}{15.45} = 0.74\), which means your score of 80 was 0.74 SD above the mean.
If you scored a 60%: \(Z = \dfrac{(60 - 68.55)}{15.45} = -0.55\), which means your score of 60 was 0.55 SD below the mean.

Is it always good to have a positive Z score? It depends on the question. For exams, you would want a positive Z-score (indicates you scored higher than the mean). However, if one was analyzing days of missed work then a negative Z-score would be more appealing as it would indicate the person missed less than the mean number of days.

Characteristics of Z-scores

The scores can be positive or negative.
For data that is symmetric (i.e. bell-shaped) or nearly symmetric, a common application of Z-scores for identifying potential outliers is for any Z-scores that are beyond ± 3.
Maximum possible Z-score for a set of data is \(\dfrac{(n−1)}{\sqrt{n}}\)

From Z-score to Probability and Percentiles Section

A more frequent application of standard normal curves are expressions of percentiles or probabilities. Because the value of a z score can be aligned with a specific position within a standard normal distribution, it is also possible to find the equivalent percentile for that observations.

For example, an observation that has a z score of zero would be at the 50th percentile of the data.

We can also use these properties to make statements about probabilities about relationships among observations. We will use this information extensively as we progress into hypothesis testing in future modules. However in the immediate application, a simple application for Susie illustrates this point. If Susie knows that a hospital has a z score of +4.00, then she can use software or consult a Standard Normal Table to convert this to a probability. For example with a mean of 50 heart attacks and a standard deviation of 10, Susie can calculate the probability of a hospital reporting more than 100 heart attacks is very low (because 100 heart attacks is in the 99th percentile, for a hospital to report more than that would be very UNLIKELY!)

We will not reference the standard normal tables in these notes, and instead rely on software to generate probabilities for us.