2.6 - (Pearson) Correlation Coefficient r

The correlation coefficient r is directly related to the coefficient of determination r2 in the obvious way. If r2 is represented in decimal form, e.g. 0.39 or 0.87, then all we have to do to obtain r is to take the square root of r2:

\[r= \pm \sqrt{r^2}\]

The sign of r depends on the sign of the estimated slope coefficient b1:

  • If b1 is negative, then r takes a negative sign.
  • If b1 is positive, then r takes a positive sign.

That is, the estimated slope and the correlation coefficient r always share the same sign. Furthermore, because r2 is always a number between 0 and 1, the correlation coefficient r is always a number between -1 and 1.

One advantage of r is that it is unitless, allowing researchers to make sense of correlation coefficients calculated on different data sets with different units. The "unitless-ness" of the measure can be seen from an alternative formula for r, namely:

\[r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}\]

If x is the height of an individual measured in inches and y is the weight of the individual measured in pounds, then the units for the numerator is inches × pounds. Similarly, the units for the denominator is inches × pounds. Because they are the same, the units in the numerator and denominator cancel each other out, yielding a "unitless" measure.

Another formula for r that you might see in the regression literature is one that illustrates how the correlation coefficient r is a function of the estimated slope coefficient b1:

\[r=\frac{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}}{\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}\times b_1\]

We are readily able to see from this version of the formula that:

  • The estimated slope b1 of the regression line and the correlation coefficient r always share the same sign. If you don't see why this must be true, view this screencast.
  • The correlation coefficient r is a unitless measure. If you don't see why this must be true, view this screencast.
  • If the estimated slope b1 of the regression line is 0, then the correlation coefficient r must also be 0.

That's enough with the formulas! As always, we will let statistical software such as R or Minitab do the dirty calculations for us. For the skin cancer mortality and latitude example (skincancer.txt), the correlation between skin cancer mortality and latitude is -0.825. It doesn't matter the order in which you specify the variables, so the correlation between latitude and skin cancer mortality is also -0.825. What does this correlation coefficient tells us? That is, how do we interpret the Pearson correlation coefficient r? In general, there is no nice practical operational interpretation for r as there is for r2. You can only use r to make a statement about the strength of the linear relationship between x and y. In general:

  • If r = -1, then there is a perfect negative linear relationship between x and y.
  • If r = 1, then there is a perfect positive linear relationship between x and y.
  • If r = 0, then there is no linear relationship between x and y.

All other values of r tell us that the relationship between x and y is not perfect. The closer r is to 0, the weaker the linear relationship. The closer r is to -1, the stronger the negative linear relationship. And, the closer r is to 1, the stronger the positive linear relationship. As is the case for the r2 value, what is deemed a "large" correlation coefficient r value depends greatly on the research area.

So, what does the correlation of -0.825 between skin cancer mortality and latitude tell us? It tells us:

  • The relationship is negative. As the latitude increases, the skin cancer mortality rate decreases (linearly).
  • The relationship is quite strong (since the value is pretty close to -1)