2.6 - (Pearson) Correlation Coefficient r

The correlation coefficient r is directly related to the coefficient of determination r² in the obvious way. If r² is represented in decimal form, e.g. 0.39 or 0.87, then all we have to do to obtain r is to take the square root of r²:

\[r= \pm \sqrt{r^2}\]

The sign of r depends on the sign of the estimated slope coefficient b_1:

If b₁ is negative, then r takes a negative sign.
If b₁ is positive, then r takes a positive sign.

That is, the estimated slope and the correlation coefficient r always share the same sign. Furthermore, because r² is always a number between 0 and 1, the correlation coefficient r is always a number between -1 and 1.

One advantage of r is that it is unitless, allowing researchers to make sense of correlation coefficients calculated on different data sets with different units. The "unitless-ness" of the measure can be seen from an alternative formula for r, namely:

\[r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}\]

If x is the height of an individual measured in inches and y is the weight of the individual measured in pounds, then the units for the numerator is inches × pounds. Similarly, the units for the denominator is inches × pounds. Because they are the same, the units in the numerator and denominator cancel each other out, yielding a "unitless" measure.

Another formula for r that you might see in the regression literature is one that illustrates how the correlation coefficient r is a function of the estimated slope coefficient b₁:

\[r=\frac{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}}{\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}\times b_1\]

We are readily able to see from this version of the formula that:

The estimated slope b₁ of the regression line and the correlation coefficient r always share the same sign. If you don't see why this must be true, view this screencast.

The correlation coefficient r is a unitless measure. If you don't see why this must be true, view this screencast.

If the estimated slope b₁ of the regression line is 0, then the correlation coefficient r must also be 0.

That's enough with the formulas! As always, we will let statistical software such as R or Minitab do the dirty calculations for us. For the skin cancer mortality and latitude example (skincancer.txt), the correlation between skin cancer mortality and latitude is -0.825. It doesn't matter the order in which you specify the variables, so the correlation between latitude and skin cancer mortality is also -0.825. What does this correlation coefficient tells us? That is, how do we interpret the Pearson correlation coefficient r? In general, there is no nice practical operational interpretation for r as there is for r². You can only use r to make a statement about the strength of the linear relationship between x and y. In general:

If r = -1, then there is a perfect negative linear relationship between x and y.
If r = 1, then there is a perfect positive linear relationship between x and y.
If r = 0, then there is no linear relationship between x and y.

All other values of r tell us that the relationship between x and y is not perfect. The closer r is to 0, the weaker the linear relationship. The closer r is to -1, the stronger the negative linear relationship. And, the closer r is to 1, the stronger the positive linear relationship. As is the case for the r² value, what is deemed a "large" correlation coefficient r value depends greatly on the research area.

So, what does the correlation of -0.825 between skin cancer mortality and latitude tell us? It tells us:

The relationship is negative. As the latitude increases, the skin cancer mortality rate decreases (linearly).
The relationship is quite strong (since the value is pretty close to -1)

2.6 - (Pearson) Correlation Coefficient r

Navigation

Start Here!

Lessons

Resources