Cautions about Correlation and Regression

Influence Outliers

In most practical circumstances an outlier decreases the value of a correlation coefficient and weakens the regression relationship, but it’s also possible that in some circumstances an outlier may increase a correlation value and improve regression. Figure 1 below provides an example of an influential outlier. Influential outliers are points in a data set that influence the regression equation and improve correlation. Figure 1 represents data gather on a persons Age and Blood Pressure, with Age as the explanatory variable. [Note: the regression plots were attained in Minitab by Stat > Regression > Fitted Line Plot.] The top graph in Figure 1 represents the complete set of 10 data points. You can see that one point stands out in the upper right corner, point of (75, 220). The bottom graph is the regression with this point removed. The correlation between the original 10 data points is 0.694 found by taking the square root of 0.481 (the R-sq of 48.1%). But when this outlier is removed, the correlation drops to 0.032 from the square root of 0.1%. Also, notice how the regression equation originally has a slope greater than 0, but with the outlier removed the slope is practically 0, i.e. nearly a horizontal line. This example is somewhat exaggerated, but the point illustrates the effect of an outlier can play on the correlation and regression equation. Such points are referred to as influential outliers. As this example illustrates you can see the influence the outlier has on the regression equation and correlation. Typically these influential points are far removed from the remaining data points in at least the horizontal direction. As seen here, the age of 75 and the blood pressure of 220 are both beyond the scope of the remaining data.

plots

Correlation and Causation

If we conduct a study and we establish a strong correlation does this mean we also have causation? That is, if two variables are related does that imply that one variable causes the other to occur? Consider smoking cigarettes and lung cancer: does smoking cause lung cancer. Initially this was answered as yes, but this was based on a strong correlation between smoking and lung cancer. Not until scientific research verified that smoking can lead to lung cancer was causation established. If you were to review the history of cigarette warning labels, the first mandated label only mentioned that smoking was hazardous to your health. Not until 1981 did the label mention that smoking causes lung cancer. (See warning labels). To establish causation one must rule out the possibility of lurking variable(s). The best method to accomplish this is through a solid design of your experiment, preferably one that uses a control group.