7: Correlation - What it really means

Overview

Case Study: Baseball

Clark is a big Boston Red Sox fan. Noticing that the Red Sox had a slightly better home record than away, he was interested to see if there is an association between batting averages at away (y) verses home (x) games in the 2018 regular season. Clark included stats from players who played at least 3 away and 3 home games (One player had a batting average of .000 for home games). He got his data from: mlb.com/stats. Clark used Minitab to create a scatterplot of the data.

From his scatterplot, he was surprised to see that the higher batting averages at home were not always higher away. He also noticed two of the two top hitters: Mookie Betts and JD Martinez had data up in the upper right hand corner.

In looking at the data, Clark could not decide what to conclude. He decided to run a correlation in Minitab. He got the following output: r = 0.280, p-value = 0.261. What should he conclude about the batting averages home vs away?

Clark is off to a great start. He realizes that he is working with two quantitative variables (home batting average and away batting average). Let’s take a closer look at the statistical methods he used to come up with his results.

Objectives

Upon completion of this lesson, you should be able to:

Use a scatterplot to appropriately graph two quantitative variables
Apply a correlation technique to two quantitative variables
Define the difference between a correlation and a covariance
Identify the magnitude, direction, and linearity of a correlation coefficient.

7.1 - Scatterplots

With Clark’s interest in investigating the relationship between two quantitative variables, he correctly started with a scatterplot.

Scatterplot: A graphical representation of two quantitative variables where the explanatory variable is on the x-axis and the response variable is on the y-axis.

When we look at the scatterplot, keep in mind the following questions:

What is the direction of the relationship?
Is the relationship linear or nonlinear?
Is the relationship weak, moderate, or strong?
Are there any outliers or extreme values?

We describe the direction of the relationship as positive or negative. A positive relationship means that as the value of the explanatory variable increases, the value of the response variable increases, in general. A negative relationship implies that as the value of the explanatory variable increases, the value of the response variable tends to decrease.

Looking at Clark’s scatterplot, we see a positive direction. As home batting average is increasing in value (moving toward the right on the X axis) away batting average is also increasing in value (moving up the Y axis).

While Clark does not have a lot of data, it is possible to see that the pattern of the data suggests a linear (straight line) pattern. There is no discernable curve to the pattern. While this is not as obvious as we might like it to be, for real-world data, this is pretty typical.

Finally, we want to as about the magnitude of the data. We do this by looking at the degree of the slope (think running up a hill). The slope of the red line indicates a slope that is closer to a zero line (no hill running) then a steep slope (hard hill running).

Now that Clark has taken a closer look at describing his data, let’s take a closer look at analyzing the relationship between home and away batting averages!

7.2 - Correlation

The Correlation Coefficient

If we want to provide a measure of the strength of the linear relationship between home and away batting averages (two quantitative variables), a good way is to report the correlation coefficient between them.

The sample correlation coefficient is typically denoted as \(r\). It is also known as Pearson’s \(r\). The population correlation coefficient is generally denoted as \(\rho\), pronounced “rho.”

Sample Correlation Coefficient

The sample correlation coefficient, \(r\), is calculated using the following formula:

\( r=\dfrac{\sum (x_i-\bar{x})(y_i-\bar{y}) }{\sqrt{\sum (x_i-\bar{x})^2}\sqrt{\sum (y_i-\bar{y})^2}} \)

If you have a solid foundation of the material covered in this course up to this point you should notice that the term \(x-\bar{x}\) (and also \(y-\bar{y}) are simple deviation scores. As you (should) know, the deviation score is the starting point to calculate the variance of a variable. Thus a correlation coefficient is simply the co-variance of two variables!

The advantage of the correlation coefficient is that the denominator provides a standardization of the value of the correlation coefficient because it divides the covariance by the product of the standard deviations of the two variables.

Properties of the Correlation Coefficient, r

To summarize, some important properties of the correlation coefficient, r :

\(-1\le r\le 1\), i.e. \(r\) takes values between -1 and +1, inclusive.
The sign of the correlation provides the direction of the linear relationship. The sign indicates whether the two variables are positively or negatively related.
A correlation of 0 means there is no linear relationship.
There are no units attached to \(r\).
As the magnitude of \(r\) approaches 1, the stronger the linear relationship.
As the magnitude of \(r \) approaches 0, the weaker the linear relationship.
If we fit the simple linear regression model between Y and X, then \(r\) has the same sign as \(\beta_1\), which is the coefficient of X in the linear regression equation. -- more on this later.
The correlation value would be the same regardless of which variable we defined as X and Y.

7.3 - Visualizing Correlation

The following four graphs illustrate four possible situations for the values of r. Pay particular attention to graph (d) which shows a strong relationship between y and x but where r = 0. Note that no linear relationship does not imply no relationship exists!

a) \(r > 0\)

b) \(r < 0\)

c) \(r = 0\)

d) \(r=0\)

With a correlation coefficient, Clark can now make a conclusion about the association between home and away batting averages. With a scatterplot that suggests linearity, Clark can conclude the correlation statistic (r=0.280) is not statistically significant (p>.05). This indicates a player’s home batting average is not associated with how well a player bats away.

Let’s take another look at Clark’s data. Now that we have a good understanding of correlation, we can take a closer look at another measure of association for quantitative variables, the covariance. Let’s revisit the covariance formula.

The variance is a measure of how much any one observation deviates from its mean. In Clark’s example, the variance of the home batting average would be a single player’s batting average minus the mean batting average.

\(\text { Variance }=\dfrac{\sum\left(x_{i}-\bar{x}\right)^{2}}{N-1}\)

Now, the covariance is simply a measure of the variance of two variables. In Clark’s example, the covariance would be the product of a player’s home batting average minus the mean of all home batting averages times that same player’s away batting average minus the mean of all away batting averages.

\(Cov(x,y)=\dfrac{\sum(x_i-\bar{x})(y_i-\bar{y})}{N-1}\)

Now, if you look back at the formula for the correlation, you can see that the correlation is simply the covariance divided by the product of the standard deviations of the two variables. In other words, just as we standardized the variance to create the standard deviation, we standardize the covariance to create the correlation.

\(=\dfrac{\sum(x_i-\bar{x})(y_i-\bar{y})}{(N-1)s_xs_y}\)

Now, why do we even bother talking about the covariance when the correlation is much easier to understand (just like the standard deviation is much easier to understand)? The reason is that as statistical techniques become more advanced the covariance of two variables can tell us a lot about how well our statistics are doing. While we will not get too far into this, it is important to understand this foundational idea of covariance if you progress further into statistics.

To summarize, the correlation is used to test if a significant relationship exists between two quantitative variables, but these variables must be linearly related, which we can see through using a scatterplot before conducting a correlation. While you may likely never see nor use the covariance, it is an important diagnostic tool for more advanced statistical techniques.

7.4 - Summary

Case-Study: Baseball

Clark correctly identified his data as two quantitative variables. His question about these two variables was simple, is there any relationship between them. By examining the magnitude, direction, and linearity of the scatterplot, we were able to describe the relationship between home and away batting average. Next, the correlation coefficient allowed Clark to conclude that home and away batting averages are not related. So we may not know what to expect from the Red Sox on the road!

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility