# 2.6 - (Pearson) Correlation Coefficient r

The correlation coefficient *r* is directly related to the coefficient of determination *r*^{2} in the obvious way. If *r*^{2} is represented in decimal form, *e.g.* 0.39 or 0.87, then all we have to do to obtain *r* is to take the square root of *r*^{2}:

\[r= \pm \sqrt{r^2}\]

The sign of *r* depends on the sign of the estimated slope coefficient *b*_{1:}

- If
*b*_{1}is negative, then*r*takes a negative sign. - If
*b*_{1}is positive, then*r*takes a positive sign.

That is, the estimated slope and the correlation coefficient *r* always share the same sign. Furthermore, because *r*^{2} is always a number between 0 and 1, the correlation coefficient *r* is always a number between -1 and 1.

One advantage of *r* is that it is unitless, allowing researchers to make sense of correlation coefficients calculated on different data sets with different units. The "unitless-ness" of the measure can be seen from an alternative formula for *r*, namely:

\[r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}\]

If *x* is the height of an individual measured in inches and *y* is the weight of the individual measured in pounds, then the units for the numerator is inches × pounds. Similarly, the units for the denominator is inches × pounds. Because they are the same, the units in the numerator and denominator cancel each other out, yielding a "unitless" measure.

Another formula for *r* that you might see in the regression literature is one that illustrates how the correlation coefficient *r* is a function of the estimated slope coefficient *b*_{1}:

\[r=\frac{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}}{\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}\times b_1\]

We are readily able to see from this version of the formula that:

- The estimated slope
*b*_{1}of the regression line and the correlation coefficient*r*always share the same sign. If you don't see why this must be true, view this screencast. - The correlation coefficient
*r*is a unitless measure. If you don't see why this must be true, view this screencast. - If the estimated slope
*b*_{1}of the regression line is 0, then the correlation coefficient*r*must also be 0.

That's enough with the formulas! As always, we will let statistical software such as R or Minitab do the dirty calculations for us. For the skin cancer mortality and latitude example (skincancer.txt), the correlation between skin cancer mortality and latitude is -0.825. It doesn't matter the order in which you specify the variables, so the correlation between latitude and skin cancer mortality is also -0.825. What does this correlation coefficient tells us? That is, how do we interpret the Pearson correlation coefficient *r*? In general, there is no nice practical operational interpretation for *r* as there is for *r*^{2}. You can only use *r* to make a statement about the strength of the linear relationship between *x* and *y*. In general:

- If
*r*= -1, then there is a perfect negative linear relationship between*x*and*y*. - If
*r*= 1, then there is a perfect positive linear relationship between*x*and*y*. - If
*r*= 0, then there is no linear relationship between*x*and*y*.

All other values of *r* tell us that the relationship between *x* and *y* is not perfect. The closer *r* is to 0, the weaker the linear relationship. The closer *r* is to -1, the stronger the negative linear relationship. And, the closer *r* is to 1, the stronger the positive linear relationship. As is the case for the *r*^{2} value, what is deemed a "large" correlation coefficient *r* value depends greatly on the research area.

So, what does the correlation of -0.825 between skin cancer mortality and latitude tell us? It tells us:

- The relationship is negative. As the latitude increases, the skin cancer mortality rate decreases (linearly).
- The relationship is quite strong (since the value is pretty close to -1)