7.3 - Visualizing Correlation

The following four graphs illustrate four possible situations for the values of r. Pay particular attention to graph (d) which shows a strong relationship between y and x but where r = 0. Note that no linear relationship does not imply no relationship exists!

a) \(r > 0\)

b) \(r < 0\)

c) \(r = 0\)

d) \(r=0\)

With a correlation coefficient, Clark can now make a conclusion about the association between home and away batting averages. With a scatterplot that suggests linearity, Clark can conclude the correlation statistic (r=0.280) is not statistically significant (p>.05). This indicates a player’s home batting average is not associated with how well a player bats away.

Let’s take another look at Clark’s data. Now that we have a good understanding of correlation, we can take a closer look at another measure of association for quantitative variables, the covariance. Let’s revisit the covariance formula.

The variance is a measure of how much any one observation deviates from its mean. In Clark’s example, the variance of the home batting average would be a single player’s batting average minus the mean batting average.

\(\text { Variance }=\dfrac{\sum\left(x_{i}-\bar{x}\right)^{2}}{N-1}\)

Now, the covariance is simply a measure of the variance of two variables. In Clark’s example, the covariance would be the product of a player’s home batting average minus the mean of all home batting averages times that same player’s away batting average minus the mean of all away batting averages.

\(Cov(x,y)=\dfrac{\sum(x_i-\bar{x})(y_i-\bar{y})}{N-1}\)

Now, if you look back at the formula for the correlation, you can see that the correlation is simply the covariance divided by the product of the standard deviations of the two variables. In other words, just as we standardized the variance to create the standard deviation, we standardize the covariance to create the correlation.

\(=\dfrac{\sum(x_i-\bar{x})(y_i-\bar{y})}{(N-1)s_xs_y}\)

Now, why do we even bother talking about the covariance when the correlation is much easier to understand (just like the standard deviation is much easier to understand)? The reason is that as statistical techniques become more advanced the covariance of two variables can tell us a lot about how well our statistics are doing. While we will not get too far into this, it is important to understand this foundational idea of covariance if you progress further into statistics.

To summarize, the correlation is used to test if a significant relationship exists between two quantitative variables, but these variables must be linearly related, which we can see through using a scatterplot before conducting a correlation. While you may likely never see nor use the covariance, it is an important diagnostic tool for more advanced statistical techniques.