There are three key caveats that must be recognized with regard to correlation.
- It is impossible to prove causal relationships with correlation. However, the strength of the evidence for such a relationship can be evaluated by examining and eliminating important alternate explanations for the correlation seen.
- Outliers can substantially inflate or deflate the correlation.
- Correlation describes the strength and direction of the linear association between variables. It does not describe non-linear relationships
Correlation and Causation Section
It is often tempting to suggest that, when the correlation is statistically significant, the change in one variable causes the change in the other variable. However, outside of randomized experiments, there are numerous other possible reasons that might underlie the correlation. Thus, it is crucial to evaluate and eliminate the key alternative (non-causal) relationships outlined in section 6.2 to build evidence toward causation.
- Check for the possibility that the response might be directly affecting the explanatory variable (rather than the other way around). For example, you might suspect that the number of times children wash their hands might be causally related to the number of cases of the common cold amongst the children at a pre-school. However, it is also possible that children who have colds are made to wash their hands more often. In this example, it would also be important to evaluate the timing of the measured variables - does an increase in the amount of hand washing precede a decrease in colds or did it happen at the same time?
- Check whether changes in the explanatory variable contribute, along with other variables, to changes in the response. For example, the amount of dry brush in a forest does not cause a forest fire; but it will contribute to it if a fire is ignited.
- Check for confounders or common causes that may affect both the explanatory and response variables. For example, there is a moderate association between whether a baby is breastfed or bottle-fed and the number of incidences of gastroenteritis recorded on medical charts (with the breastfed babies showing more cases). But it turns out that breastfed babies also have, on average, more routine medical visits to pediatricians. Thus, the number of opportunities for mild cases of gastroenteritis to be recorded on medical charts is greater for the breastfed babies providing a clear confounder.
- Check whether both variables may have changed together over time or space. For example, data on the number of cases of internet fraud and on the amount spent on election campaigns in the United States taken over the last 30 years would have a strong association merely because they have both increased over time. As another example, if you examine the percent of the population with home computers and the life expectancy for every country in the world, there will be a positive association merely because richer countries have both how life expectancy and greater computer use. Tyler Vigen's website lists thousands of spurious correlations that result from variables that coincidentally change the same way over time.
- Check whether the association between the variables might be just a matter of coincidence. This is where a check for the degree of statistical significance would be important. However, it is also important to consider whether the search for significance was a priori or a posteriori. For example, a story in the national news one year reported that at a hospital in Potsdam, New York, 15 babies in a row were all boys. Does that indicate that something at that hospital was causing more male than female births? Clearly, the answer is no, even if the chance of having 15 boys in a row is quite low (about 1 chance in 33,000). But there are over 5000 hospitals in the United States and the story would be just as newsworthy if it happened at any one of them at any time of the year and for either 15 boys in a row or for 15 girls in a row. Thus, it turns out that we actually expect a story like this to happen once or twice a year somewhere in the United States every year.
Example 5.4: Effect of Outliers on Correlation Section
Below is a scatterplot of the relationship between the Infant Mortality Rate and the Percent of Juveniles Not Enrolled in School for each of the 50 states plus the District of Columbia. The correlation is 0.73, but looking at the plot one can see that for the 50 states alone the relationship is not nearly as strong as a 0.73 correlation would suggest. Here, the District of Columbia (identified by the X) is a clear outlier in the scatter plot being several standard deviations higher than the other values for both the explanatory (x) variable and the response (y) variable. Without Washington D.C. in the data, the correlation drops to about 0.5.
Figure 5.5. Scatterplot with outlier
Correlation and Outliers Section
Correlations measure linear association - the degree to which relative standing on the x list of numbers (as measured by standard scores) are associated with the relative standing on the y list. Since means and standard deviations, and hence standard scores, are very sensitive to outliers, the correlation will be as well.
In general, the correlation will either increase or decrease, based on where the outlier is relative to the other points remaining in the data set. An outlier in the upper right or lower left of a scatterplot will tend to increase the correlation while outliers in the upper left or lower right will tend to decrease a correlation.
Watch the two videos below. They are similar to the video in section 5.2 except that a single point (shown in red) in one corner of the plot is staying fixed while the relationship amongst the other points is changing. Compare each with the movie in section 5.2 and see how much that single point changes the overall correlation as the remaining points have different linear relationships.
Even though outliers may exist, you should not just quickly remove these observations from the data set in order to change the value of the correlation. As with outliers in a histogram, these data points may be telling you something very valuable about the relationship between the two variables. For example, in a scatterplot of in-town gas mileage versus highway gas mileage for all 2015 model year cars, you will find that hybrid cars are all outliers in the plot (unlike gas-only cars, a hybrid will generally get better mileage in-town that on the highway).