18.4 - More on Understanding Rho

Although we started investigating the meaning of the correlation coefficient, we've still been dancing quite a bit around what exactly the correlation coefficient:

\(Corr(X,Y)=\rho=\dfrac{Cov(X,Y)}{\sigma_X \sigma_Y}=\dfrac{\sum_x\sum_y (x-\mu_X)(y-\mu_Y)f(x,y)}{\sigma_X \sigma_Y}\)

tells us. Since this is the last page of the lesson, I guess there is no more procrastinating! Let's spend this page, then, trying to come up with answers to the following questions:

How does \(\rho_{XY}\) get its sign?
Why is \(\rho_{XY}\) a measure of linear relationship?
Why is \(-1 \leq \rho_{XY} \leq 1\)?
Why does \(\rho_{XY}\) close to −1 or +1 indicate a strong linear relationship?

Question #1 Section

Let's tackle the first question. How does \(\rho_{XY}\) get its sign? Well, we can get a good feel for the answer to that question by simply studying the formula for the correlation coefficient:

The standard deviations \(\sigma_X\) and \(\sigma_Y\) are positive. Therefore, the product \(\sigma_X\sigma_Y\) must also be positive (>0). And, the joint probability mass function must be nonnegative... well, positive (>0) for at least some elements of the support. It is the product:

\((x-\mu_X)(y-\mu_Y)\)

that can be either positive (>0) or negative (<0). That is, the correlation coefficient gets its sign, that is, it is either negative − or positive +, depending on how most of the \((x,y)\) points in the support relate to the \(x=\mu_X\) and \(y=\mu_Y\) lines. Let's take a look at two examples.

Suppose we were interested in studying the relationship between atmospheric pressure \(X\) and the boiling point \(Y\) of water. Then, our plot might look something like this:

The plot suggests that as the atmospheric pressure increases, so does the boiling point of water. Now, what does the plot tell us about the product:

\((x-\mu_X)(y-\mu_Y)\)

Well, it tells us this:

That is, in the upper right quadrant, the difference between any \((x,y)\) data point and the \(x=\mu_X\) line is positive; and the difference between any \((x,y)\) data point and the \(y=\mu_Y\) line is positive. Therefore, any \((x,y)\) data point in the upper right quadrant produces a positive product \((x-\mu_X)(y-\mu_Y)\). Now for the lower left quadrant, where the remaining points lie. In the lower left quadrant, the difference between any \((x,y)\) data point and the \(x=\mu_X\) line is negative; and the difference between any \((x,y)\) data point and the \(y=\mu_Y\) line is negative. Therefore, any \((x,y)\) data point in the lower left quadrant also produces a positive product \((x-\mu_X)(y-\mu_Y)\). So, regardless... every data point in this plots produces a positive product \((x-\mu_X)(y-\mu_Y)\). Therefore, when we add up those positive products over all \(x\) and \(y\), we're going to get a positive correlation coefficient. In general, when there is a positive linear relationship between \(X\) and \(Y\), the sign of the correlation coefficient is going to be positive. Makes intuitive sense!

Now, let's take a look at an example in which the relationship between \(X\) and \(Y\) is negative. Suppose we were interested in studying the relationship between a person's IQ \(X\) and the delinquency index \(Y\) of the person. Well, one researcher investigated the relationship, and published a plot that looked something like this:

The plot suggests that as IQs increase, the delinquency indices decrease. That is, there is an inverse or negative relationship. Now, what does the plot tell us about the product:

\((x-\mu_X)(y-\mu_Y)\)

Well, it tells us this:

That is, in the upper left quadrant, the difference between any \((x,y)\) data point and the \(x=\mu_X\) line is negative; and the difference between any \((x,y)\) data point and the \(y=\mu_Y\) line is positive. Therefore, any \((x,y)\) data point in the upper left quadrant produces a negative product. Now for the lower right quadrant, where most the remaining points lie. In the lower right quadrant, the difference between any \((x,y)\)data point and the \(x=\mu_X\) line is positive; and the difference between any \((x,y)\) data point and the \(y=\mu_Y\) line is negative. Therefore, any \((x,y)\) data point in the lower left quadrant also produces a negative product. Now there are a few data points that lie in the upper right and lower left quadrants that would produce a positive product. But, since most of the data points produce negative products, the sum of the products would still be negative. In general, when there is a negative linear relationship between \(X\) and \(Y\), the sign of the correlation coefficient is going to be negative. Again, makes intuitive sense!

Questions #2, #3, #4 Section

As it turns out, answering the last three questions is going to take a bit of preliminary work before we arrive at the final answers. To make our work concrete, let's suppose that the random variables \(X\) and \(Y\) have a trinomial distribution with \(n=2, p_1=\frac{1}{4}, p_2=\frac{1}{2}\), and \(0\le x+y\le 2\). For trinomial random variables, we typically represent the joint probability mass function as a formula. In this case, let's represent the joint probability mass function as a graph:

Each of the black dots (•) represents an element of the joint support \(S\). As we should expect with a trinomial, the support is triangular. The probabilities that \(X=x\) and \(Y=y\) are indicated in blue. For example, the probability that \(X=0\) and \(Y=1\) is \(\frac{4}{16}\). You can verify these probabilities, if you are so inclined, using the formula for the trinomial p.m.f. What we want to do here, though, is explore the correlation between \(X\) and \(Y\). Now, we'll soon see that we can learn something about the correlation \(\rho_{XY}\) by considering the best fitting line through the \((x,y)\) points in the support. Specifically, consider the best fitting line passing through the point \((\mu_X, \mu_Y)\). We don't yet know what the best fitting line is, but we could "eyeball" such a line on our graph. That's what the red line is here, an "eyeballed" best fitting line:

As the plot suggests, the mean of \(X\) is \(\frac{1}{2}\) and the mean of \(Y\) is 1 (that's because \(X\) is binomial with \(n=2\) and \(p_1=\frac{1}{4}\), and \(Y\) is binomial with \(n=2\) and \(p_2=\frac{1}{2}\)). Now, what we want to do is find the formula for the best (red) fitting line passing through the point \((\mu_X, \mu_Y)\). Well, we know that two points determine a line. So, along with the \((\mu_X, \mu_Y)\) point, let's pick an arbitrary point (x,y) on the line:

Then, we know that the slope of the line is rise over run. That is:

\(slope=\dfrac{rise}{run}=\dfrac{y-\mu_Y}{x-\mu_X}=b\)

and the line is therefore of the form:

\(y-\mu_Y=b(x-\mu_X)\) or \(y=\mu_Y+b(x-\mu_X)\)

Now to find the best fitting line, we'll use the principle of least squares. That is, we'll find the slope \(b\) that minimizes the squared vertical distances between every point \((x_0, y_0)\) in the joint support \(S\) and the point on the line:

\((x_0,\mu_Y+b(x_0-\mu_X))\)

as illustrated here in green:

That is, we need to find the \(b\) that minimizes:

\(K(b)=E\{[(Y-\mu_Y)-b(X-\mu_X)]^2\}\)

The resulting line is called the least squares regression line. What is the least squares regression line?

Solution

Before differentiating, let's start with simplifying the thing that we are trying to minimize:

\(K(b)=E\{[(Y-\mu_Y)-b(X-\mu_X)]^2\}\)

getting:

Now, to find the slope \(b\) that minimizes \(K(b)\), the expected squared vertical distances, we need to differentiate \(K(b)\) with respect to \(b\), and set the resulting derivative to 0. Doing so, we get:

\(K'(b)=-2\rho \sigma_X\sigma_Y + 2b\sigma^2_X \equiv 0\)

Then, solving for \(b\), we first get:

\(b \sigma^2_X=\rho \sigma_X\sigma_Y \)

and then finally:

\(b=\rho \dfrac{\sigma_Y}{\sigma_X}\)

Note that \(b\) does indeed minimize \(K(b)\), because the second derivative of \(K(b)\) is positive. That is:

\(K''(b)=2\sigma^2_X >0\)

Now, we can substitute what we have found for the slope \(b\) into our equation:

\(y=\mu_Y +b(x-\mu_X)\)

getting the least squares regression line:

\(y=\mu_Y + \rho \left(\dfrac{\sigma_Y}{\sigma_X}\right) (x-\mu_X)\)

By the way, note that, because the standard deviations of \(X\) and \(Y\) are positive, if the correlation coefficient \(\rho_{XY}\) is positive, then the slope of the least squares line is also positive. Similarly, if the correlation coefficient \(\rho_{XY}\) is negative, then the slope of the least squares line is also negative.

Now that we've found the \(b\) that minimizes \(K(b)\), what is the value of \(K(b)\) at its minimum \(b=\rho \dfrac{\sigma_Y}{\sigma_X}\)?

Solution

Substituting \(b=\rho \dfrac{\sigma_Y}{\sigma_X}\) into our simplified formula for \(K(b)\):

\(K(b)=\sigma^2_Y-2b\rho \sigma_X\sigma_Y +b^2 \sigma^2_X \)

we get:

\begin{align} K\left(\rho \dfrac{\sigma_Y}{\sigma_X}\right) &= \sigma^2_Y-2\left(\rho \dfrac{\sigma_Y}{\sigma_X}\right)\rho \sigma_X\sigma_Y+\left(\rho \dfrac{\sigma_Y}{\sigma_X}\right)^2 \sigma^2_X \\ &= \sigma^2_Y-2\rho^2 \sigma^2_Y+\rho^2 \sigma^2_Y\\ &= \sigma^2_Y(1-\rho^2) \end{align}

That is:

\(K\left(\rho \dfrac{\sigma_Y}{\sigma_X}\right) =\sigma^2_Y(1-\rho^2)\)

Okay, have we lost sight of what we are doing here? Remember that started way back when trying to answer three questions. Well, all of our hard work now makes the answers to the three questions rather straightforward. Let's take a look!

Why is \(-1 \leq \rho_{XY} \leq 1\)? Well, \(K(b)\) is an expectation of squared terms, so \(K(b)\) is necessarily non-negative. That is:

\(K(b)=\sigma^2_Y(1-\rho^2)\geq 0\)

And because the variance \(\sigma^2_Y\) is necessarily nonnegative, that implies that:

\((1-\rho^2)\geq 0\)

which implies that:

\(-\rho^2 \geq -1\)

which implies that:

\(\rho^2 \leq 1\)

and which finally implies that:

\(-1 \leq \rho \leq 1\)

Phew! Done! We have now answered the third question. Let's now tackle the second and fourth questions. Why is \(\rho_{XY}\) a measure of linearrelationship? And why does \(\rho_{XY}\) close to −1 or +1 indicate a strong linear relationship? Well, we defined \(K(b)\) so that it measures the distance of the points \((x_0,y_0)\) in the joint support \(S\) to a line. Therefore, \(\rho_{XY}\) necessarily must concern a linear relationship, and no other. Now, we can take it a step further. The smaller \(K(b)\) is, the closer the points are to the line:

\(K(b)\) is smallest, 0, when \(\rho_{XY}\) is −1 or +1. In that case, the points fall right on the line, indicating a perfect linear relationship.
\(K(b)\) is largest, \(\sigma^2_Y\) , when \(\rho_{XY}\) is 0. In that case, the points fall far away from the line, indicating a weak linear relationship.

So, there we have it! All four questions posed, and all four questions answered! We should all now have a fairly good understanding of the value of knowing the correlation between two random variables \(X\) and \(Y\).