Lesson 18: The Correlation Coefficient

Lesson 18: The Correlation Coefficient

Overview

hot chocolate

In the previous lesson, we learned about the joint probability distribution of two random variables \(X\) and \(Y\). In this lesson, we'll extend our investigation of the relationship between two random variables by learning how to quantify the extent or degree to which two random variables \(X\) and \(Y\) are associated or correlated. For example, Suppose \(X\) denotes the number of cups of hot chocolate sold daily at a local café, and \(Y\) denotes the number of apple cinnamon muffins sold daily at the same café. Then, the manager of the café might benefit from knowing whether \(X\) and \(Y\) are highly correlated or not. If the random variables are highly correlated, then the manager would know to make sure that both are available on a given day. If the random variables are not highly correlated, then the manager would know that it would be okay to have one of the items available without the other. As the title of the lesson suggests, the correlation coefficient is the statistical measure that is going to allow us to quantify the degree of correlation between two random variables \(X\) and \(Y\).


  • To learn a formal definition of the covariance between two random variables \(X\) and \(Y\).
  • To learn how to calculate the covariance between any two random variables \(X\) and \(Y\).
  • To learn a shortcut, or alternative, formula for the covariance between two random variables \(X\) and \(Y\).
  • To learn a formal definition of the correlation coefficient between two random variables \(X\) and \(Y\).
  • To learn how to calculate the correlation coefficient between any two random variables \(X\) and \(Y\).
  • To learn how to interpret the correlation coefficient between any two random variables \(X\) and \(Y\).
  • To learn that if \(X\) and \(Y\) are independent random variables, then the covariance and correlation between \(X\) and \(Y\) are both zero.
  • To learn that if the correlation between \(X\) and \(Y\) is 0, then\(X\) and \(Y\) are not necessarily independent.
  • To learn how the correlation coefficient gets its sign.
  • To learn that the correlation coefficient measures the strength of the linear relationship between two random variables \(X\) and \(Y\).
  • To learn that the correlation coefficient is necessarily a number between −1 and +1.
  • To understand the steps involved in each of the proofs in the lesson.
  • To be able to apply the methods learned in the lesson to new problems.

18.1 - Covariance of X and Y

18.1 - Covariance of X and Y

Here, we'll begin our attempt to quantify the dependence between two random variables \(X\) and \(Y\) by investigating what is called the covariance between the two random variables. We'll jump right in with a formal definition of the covariance.

Covariance

Let \(X\) and \(Y\) be random variables (discrete or continuous!) with means \(\mu_X\) and \(\mu_Y\). The covariance of \(X\) and \(Y\), denoted \(\text{Cov}(X,Y)\) or \(\sigma_{XY}\), is defined as:

\(Cov(X,Y)=\sigma_{XY}=E[(X-\mu_X)(Y-\mu_Y)]\)

That is, if \(X\) and \(Y\) are discrete random variables with joint support \(S\), then the covariance of \(X\) and \(Y\) is:

\(Cov(X,Y)=\mathop{\sum\sum}\limits_{(x,y)\in S} (x-\mu_X)(y-\mu_Y) f(x,y)\)

And, if \(X\) and \(Y\) are continuous random variables with supports \(S_1\) and \(S_2\), respectively, then the covariance of \(X\) and \(Y\) is:

\(Cov(X,Y)=\int_{S_2} \int_{S_1} (x-\mu_X)(y-\mu_Y) f(x,y)dxdy\)

Example 18-1

Suppose that \(X\) and \(Y\) have the following joint probability mass function:

\( \begin{array}{cc|ccc|c} & f(x, y) & 1 & 2 & 3 & f_{X}(x) \\ \hline x & 1 & 0.25 & 0.25 & 0 & 0.5 \\ & 2 & 0 & 0.25 & 0.25 & 0.5 \\ \hline & f_{Y}(y) & 0.25 & 0.5 & 0.25 & 1 \end{array} \)

so that \(\mu_{\mathrm{x}}=3 / 2\), \(\mu_{\mathrm{Y}}=2, \sigma_{\mathrm{X}}=1 / 2\), and \(\sigma_{\mathrm{Y}}=\sqrt{1/2}\)

What is the covariance of \(X\) and \(Y\)?

Solution

Two questions you might have right now: 1) What does the covariance mean? That is, what does it tell us? and 2) Is there a shortcut formula for the covariance just as there is for the variance? We'll be answering the first question in the pages that follow. Well, sort of! In reality, we'll use the covariance as a stepping stone to yet another statistical measure known as the correlation coefficient. And, we'll certainly spend some time learning what the correlation coefficient tells us. In regards to the second question, let's answer that one now by way of the following theorem.

Theorem

For any random variables \(X\) and \(Y\) (discrete or continuous!) with means \(\mu_X\) and \(\mu_Y\), the covariance of \(X\) and \(Y\) can be calculated as:

\(Cov(X,Y)=E(XY)-\mu_X\mu_Y\)

Proof

In order to prove this theorem, we'll need to use the fact (which you are asked to prove in your homework) that, even in the bivariate situation, expectation is still a linear or distributive operator:

Example 18.1 continued

Suppose again that \(X\) and \(Y\) have the following joint probability mass function:

\( \begin{array}{cc|ccc|c} & f(x, y) & 1 & 2 & 3 & f_{X}(x) \\ \hline x & 1 & 0.25 & 0.25 & 0 & 0.5 \\ & 2 & 0 & 0.25 & 0.25 & 0.5 \\ \hline & f_{Y}(y) & 0.25 & 0.5 & 0.25 & 1 \end{array} \)

Use the theorem we just proved to calculate the covariance of \(X\) and \(Y\).

Solution

Now that we know how to calculate the covariance between two random variables, \(X\) and \(Y\), let's turn our attention to seeing how the covariance helps us calculate what is called the correlation coefficient.


18.2 - Correlation Coefficient of X and Y

18.2 - Correlation Coefficient of X and Y

The covariance of \(X\) and \(Y\) necessarily reflects the units of both random variables. It is helpful instead to have a dimensionless measure of dependency, such as the correlation coefficient does.

Correlation Coefficient

Let \(X\) and \(Y\) be any two random variables (discrete or continuous!) with standard deviations \(\sigma_X\) and \(\sigma_Y\), respectively. The correlation coefficient of \(X\) and \(Y\), denoted \(\text{Corr}(X,Y)\) or \(\rho_{XY}\) (the greek letter "rho") is defined as:

\(\rho_{XY}=Corr(X,Y)=\dfrac{Cov(X,Y)}{\sigma_X \sigma_Y}=\dfrac{\sigma_{XY}}{\sigma_X \sigma_Y}\)

Example 18-1 (continued)

Suppose that \(X\) and \(Y\) have the following joint probability mass function:

1 2 3f (x,y)x120.2500.250.2500.250.250.50.250.50.5fY (y)fX (x)so that μX=3/2, μY= 2, σX=1/2, and σY= 1/21

What is the correlation coefficient of \(X\) and \(Y\)?

On the last page, we determined that the covariance between \(X\) and \(Y\) is \(\frac{1}{4}\). And, we are given that the standard deviation of \(X\) is \(\frac{1}{2}\), and the standard deviation of \(Y\) is the square root of \(\frac{1}{2}\). Therefore, it is a straightforward exercise to calculate the correlation between \(X\) and \(Y\) using the formula:

\(\rho_{XY}=\dfrac{\frac{1}{4}}{\left(\frac{1}{2}\right)\left(\sqrt{\frac{1}{2}}\right)}=0.71\)

So now the natural question is "what does that tell us?". Well, we'll be exploring the answer to that question in depth on the page titled More on Understanding Rho, but for now let the following interpretation suffice.

Interpretation of Correlation

On the page titled More on Understanding Rho, we will show that \(-1 \leq \rho_{XY} \leq 1\). Then, the correlation coefficient is interpreted as:

  1. If \(\rho_{XY}=1\), then \(X\) and \(Y\) are perfectly, positively, linearly correlated.
  2. If \(\rho_{XY}=-1\), then \(X\) and \(Y\) are perfectly, negatively, linearly correlated.
  3. If \(\rho_{XY}=0\), then \(X\) and \(Y\) are completely, un-linearly correlated. That is, \(X\) and \(Y\) may be perfectly correlated in some other manner, in a parabolic manner, perhaps, but not in a linear manner.
  4. If \(\rho_{XY}>0\), then \(X\) and \(Y\) are positively, linearly correlated, but not perfectly so.
  5. If \(\rho_{XY}<0\), then \(X\) and \(Y\) are negatively, linearly correlated, but not perfectly so.

So, for our example above, we can conclude that \(X\) and \(Y\) are positively, linearly correlated, but not perfectly so.


18.3 - Understanding Rho

18.3 - Understanding Rho

On this page, we'll begin our investigation of what the correlation coefficient tells us. All we'll be doing here is getting a handle on what we can expect of the correlation coefficient if \(X\) and \(Y\) are independent, and what we can expect of the correlation coefficient if \(X\) and \(Y\) are dependent. On the next page, we'll take a more in depth look at understanding the correlation coefficient. Let's start with the following theorem.

Theorem

If \(X\) and \(Y\) are independent random variables (discrete or continuous!), then:

\(Corr(X,Y)=Cov(X,Y)=0\)

Proof

For the sake of this proof, let us assume that \(X\) and \(Y\) are discrete. (The proof that follows can be easily modified if \(X\) and \(Y\) are continuous.) Let's start with the expected value of \(XY\). That is, let's see what we can say about the expected value of \(XY\) if \(X\) and \(Y\) are independent:

That is, we have shown that if \(X\) and \(Y\) are independent, then \(E(XY)=E(X)E(Y)\). Now the rest of the proof follows. If \(X\) and \(Y\) are independent, then:

\begin{align} Cov(X,Y) &=E(XY)-\mu_X\mu_Y\\ &= E(X)E(Y)-\mu_X\mu_Y\\ &= \mu_X\mu_Y-\mu_X\mu_Y=0 \end{align}

and therefore:

\(Corr(X,Y)=\dfrac{Cov(X,Y)}{\sigma_X \sigma_Y}=\dfrac{0}{\sigma_X \sigma_Y}=0\)

Let's take a look at an example of the theorem in action. That is, in the example that follows, we see a case in which \(X\) and \(Y\) are independent and the correlation between \(X\) and \(Y\) is 0.

Example 18-2

colored dice

Let \(X\) = outcome of a fair, black, 6-sided die. Because the die is fair, we'd expect each of the six possible outcomes to be equally likely. That is, the p.m.f. of \(X\) is:

\(f_X(x)=\dfrac{1}{6},\quad x=1,\ldots,6.\)

Let \(Y\) = outcome of a fair, red, 4-sided die. Again, because the die is fair, we'd expect each of the four possible outcomes to be equally likely. That is, the p.m.f. of \(Y\) is:

\(f_Y(y)=\dfrac{1}{4},\quad y=1,\ldots,4.\)

If we toss the pair of dice, the 24 possible outcomes are (1, 1) (1, 2) ... (1, 4) ... (6, 1) ... (6, 4), with each of the 24 outcomes being equally likely. That is, the joint p.m.f. of \(X\) and \(Y\) is:

\(f(x,y)=\dfrac{1}{24},\quad x=1,2,\ldots,6,\quad y=1,\ldots,4.\)

Although we intuitively feel that the outcome of the black die is independent of the outcome of the red die, we can formally show that \(X\) and \(Y\) are independent:

\(f(x,y)=\dfrac{1}{24}f_X(x)f_Y(y)=\dfrac{1}{6} \cdot \dfrac{1}{4} \qquad \forall x,y\)

What is the covariance of \(X\) and \(Y\) ? What the correlation of \(X\) and \(Y\) ?

Solution

Well, the mean of \(X\) is:

\(\mu_X=E(X)=\sum\limits_x xf(x)=1\left(\dfrac{1}{6}\right)+\cdots+6\left(\dfrac{1}{6}\right)=\dfrac{21}{6}=3.5\)

And, the mean of \(Y\) is:

\(\mu_Y=E(Y)=\sum\limits_y yf(y)=1\left(\dfrac{1}{4}\right)+\cdots+4\left(\dfrac{1}{4}\right)=\dfrac{10}{4}=2.5\)

The expected value of the product \(XY\) is:

\(E(XY)=\sum\limits_x\sum\limits_y xyf(x,y)=(1)(1)\left(\dfrac{1}{24}\right)+(1)(2)\left(\dfrac{1}{24}\right)+\cdots+(6)(4)\left(\dfrac{1}{24}\right)=\dfrac{210}{24}=8.75\)

Therefore, the covariance of \(X\) and \(Y\) is:

\(Cov(X,Y)=E(XY)-\mu_X\mu_Y=8.75-(3.5)(2.5)=8.75-8.75=0\)

and therefore, the correlation between \(X\) and \(Y\) is 0:

\(Corr(X,Y)=\dfrac{Cov(X,Y)}{\sigma_X \sigma_Y}=\dfrac{0}{\sigma_X \sigma_Y}=0\)

Again, this example illustrates a situation in which \(X\) and \(Y\) are independent, and the correlation between \(X\) and \(Y\) is 0, just as the theorem states it should be.

NOTE! the converse of the theorem is not necessarily true! That is, zero correlation and zero covariance do not imply independence. Let's take a look at an example that illustrates this claim.

Example 18-3

Let \(X\) and \(Y\) be discrete random variables with the following joint probability mass function:

-1 0 1f (x,y)yx-1010.2000.2000.2000.2000.200.40.20.40.40.20.4fY (y)fX (x)1

What is the correlation between \(X\) and \(Y\)? And, are \(X\) and \(Y\) independent?

Solution

The mean of \(X\) is:

\(\mu_X=E(X)=\sum xf(x)=(-1)\left(\dfrac{2}{5}\right)+(0)\left(\dfrac{1}{5}\right)+(1)\left(\dfrac{2}{5}\right)=0\)

And the mean of \(Y\) is:

\(\mu_Y=E(Y)=\sum yf(y)=(-1)\left(\dfrac{2}{5}\right)+(0)\left(\dfrac{1}{5}\right)+(1)\left(\dfrac{2}{5}\right)=0\)

The expected value of the product \(XY\) is also 0:

\(E(XY)=(-1)(-1)\left(\dfrac{1}{5}\right)+(-1)(1)\left(\dfrac{1}{5}\right)+(0)(0)\left(\dfrac{1}{5}\right)+(1)(-1)\left(\dfrac{1}{5}\right)+(1)(1)\left(\dfrac{1}{5}\right)\)

\(E(XY)=\dfrac{1}{5}-\dfrac{1}{5}+0-\dfrac{1}{5}+\dfrac{1}{5}=0\)

Therefore, the covariance of \(X\) and \(Y\) is 0:

\(Cov(X,Y)=E(XY)-\mu_X\mu_Y=0-(0)(0)=0\)

and therefore the correlation between \(X\) and \(Y\) is necessarily 0.

Yet, \(X\) and \(Y\) are not independent, since the product space is not rectangular! That is, we can find an \(x\) and a \(y\) for which the joint probability mass function \(f(x,y)\) can't be written as the product of \(f(x)\), the probability mass function of \(X\), and \(f(y)\), the probability mass function of \(Y\). For example, when \(x=0\) and \(y=-1\):

\(f(0,-1)=0 \neq f_X(0)f_Y(-1)=(1/5)(2/5)=2/25\)

In summary, again, this example illustrates that if the correlation between \(X\) and \(Y\) is 0, it does not necessarily mean that \(X\) and \(Y\) are independent. On the contrary, we've shown a case here in which the correlation between \(X\) and \(Y\) is 0, and yet \(X\) and \(Y\) are dependent!

The contrapositive of the theorem is always true! That is, if the correlation is not zero, then \(X\) and \(Y\) are dependent. Let's take a look at an example that illustrates this claim.

Example 18-4

tshirts

A quality control inspector for a t-shirt manufacturer inspects t-shirts for defects. She labels each t-shirt she inspects as either:

  • "good"
  • a "second" which could be sold at a reduced price, or
  • "defective," in which the t-shirt could not be sold at all

The quality control inspector inspects \(n=2\) t-shirts:

  • Let \(X\) = # of good t-shirts. Historically, the probability that a t-shirt is good is \(p_1=0.6\).
  • Let \(Y\) = # of second t-shirts. Historically, the probability that a t-shirt is labeled as a second is \(p_2=0.2\).
  • Let \(2-X-Y\)= # of defective t-shirts. Historically, the probability that a t-shirt is labeled as defective is \(1-p_1-p_2=1-0.6-0.2=0.2\)

Then, the joint probability mass function of \(X\) and \(Y\) is the trinomial distribution. That is:

\(f(x,y)=\dfrac{2!}{x!y!(2-x-y)!} 0.6^x 0.2^y 0.2^{2-x-y},\qquad 0 \leq x+y \leq 2\)

Are \(X\) and \(Y\) independent? And, what is the correlation between \(X\) and \(Y\)?

Solution

First, \(X\) and \(Y\) are indeed dependent, since the support is triangular. Now, for calculating the correlation between \(X\) and \(Y\). The random variable \(X\) is binomial with \(n=2\) and \(p_1=0.6\). Therefore, the mean and standard deviation of \(X\) are 1.2 and 0.69, respectively:

\begin{array}{lcll} X \sim b(2,0.6) & \qquad & \mu_X &=np_1=2(0.6)=1.2\\ & \qquad & \sigma_X &=\sqrt{np_1(1-p_1)}=\sqrt{2(0.6)(0.4)}=0.69 \end{array}

The random variable \(Y\) is binomial with \(n=2\) and \(p_2=0.2\). Therefore, the mean and standard deviation of \(Y\) are 0.4 and 0.57, respectively:

\begin{array}{lcll} Y \sim b(2,0.2) & \qquad & \mu_Y &=np_2=2(0.2)=0.4\\ & \qquad & \sigma_Y &=\sqrt{np_2(1-p_2)}=\sqrt{2(0.2)(0.8)}=0.57 \end{array}

The expected value of the product \(XY\) is:

\begin{align} E(XY)&= \sum\limits_x \sum\limits_y xyf(x,y)\\ &= (1)(1)\dfrac{2!}{1!1!0!} 0.6^1 0.2^1 0.2^0=2(0.6)(0.2)=0.24\\ \end{align}

Therefore, the covariance of \(X\) and \(Y\) is −0.24:

\(Cov(X,Y)=E(XY)-\mu_X\mu_Y=0.24-(1.2)(0.4)=0.24-0.48=-0.24\)

and the correlation between \(X\) and \(Y\) is −0.61:

\(Corr(X,Y)=\dfrac{Cov(X,Y)}{\sigma_X \sigma_Y}=\dfrac{-0.24}{(0.69)(0.57)}=-0.61\)

In summary, again, this is an example in which the correlation between \(X\) and \(Y\) is not 0, and \(X\) and \(Y\) are dependent.


18.4 - More on Understanding Rho

18.4 - More on Understanding Rho

Although we started investigating the meaning of the correlation coefficient, we've still been dancing quite a bit around what exactly the correlation coefficient:

\(Corr(X,Y)=\rho=\dfrac{Cov(X,Y)}{\sigma_X \sigma_Y}=\dfrac{\sum_x\sum_y (x-\mu_X)(y-\mu_Y)f(x,y)}{\sigma_X \sigma_Y}\)

tells us. Since this is the last page of the lesson, I guess there is no more procrastinating! Let's spend this page, then, trying to come up with answers to the following questions:

  1. How does \(\rho_{XY}\) get its sign?
  2. Why is \(\rho_{XY}\) a measure of linear relationship?
  3. Why is \(-1 \leq \rho_{XY} \leq 1\)?
  4. Why does \(\rho_{XY}\) close to −1 or +1 indicate a strong linear relationship?

Question #1

Let's tackle the first question. How does \(\rho_{XY}\) get its sign? Well, we can get a good feel for the answer to that question by simply studying the formula for the correlation coefficient:

<0OR0<0<0<xy(x-μX)(yY)f(x,y)Corr(X,Y)= ρ =Cov(X,Y)σXσYσXσY=

The standard deviations \(\sigma_X\) and \(\sigma_Y\) are positive. Therefore, the product \(\sigma_X\sigma_Y\) must also be positive (>0). And, the joint probability mass function must be nonnegative... well, positive (>0) for at least some elements of the support. It is the product:

\((x-\mu_X)(y-\mu_Y)\)

that can be either positive (>0) or negative (<0). That is, the correlation coefficient gets its sign, that is, it is either negative − or positive +, depending on how most of the \((x,y)\) points in the support relate to the \(x=\mu_X\) and \(y=\mu_Y\) lines. Let's take a look at two examples.

Suppose we were interested in studying the relationship between atmospheric pressure \(X\) and the boiling point \(Y\) of water. Then, our plot might look something like this:

2021222324252627282930194204214pressureboilingμYμX

The plot suggests that as the atmospheric pressure increases, so does the boiling point of water. Now, what does the plot tell us about the product:

\((x-\mu_X)(y-\mu_Y)\)

Well, it tells us this:

2021222324252627282930194204214PRESSUREBOILINGμYμXy-μY>0χ-μx<0χ-μx>0y-μY<0

That is, in the upper right quadrant, the difference between any \((x,y)\) data point and the \(x=\mu_X\) line is positive; and the difference between any \((x,y)\) data point and the \(y=\mu_Y\) line is positive. Therefore, any \((x,y)\) data point in the upper right quadrant produces a positive product \((x-\mu_X)(y-\mu_Y)\). Now for the lower left quadrant, where the remaining points lie. In the lower left quadrant, the difference between any \((x,y)\) data point and the \(x=\mu_X\) line is negative; and the difference between any \((x,y)\) data point and the \(y=\mu_Y\) line is negative. Therefore, any \((x,y)\) data point in the lower left quadrant also produces a positive product \((x-\mu_X)(y-\mu_Y)\). So, regardless... every data point in this plots produces a positive product \((x-\mu_X)(y-\mu_Y)\). Therefore, when we add up those positive products over all \(x\) and \(y\), we're going to get a positive correlation coefficient. In general, when there is a positive linear relationship between \(X\) and \(Y\), the sign of the correlation coefficient is going to be positive. Makes intuitive sense!

Now, let's take a look at an example in which the relationship between \(X\) and \(Y\) is negative. Suppose we were interested in studying the relationship between a person's IQ \(X\) and the delinquency index \(Y\) of the person. Well, one researcher investigated the relationship, and published a plot that looked something like this:

70809010011012040302010IQDELINQUENCYINDEXμYμX

The plot suggests that as IQs increase, the delinquency indices decrease. That is, there is an inverse or negative relationship. Now, what does the plot tell us about the product:

\((x-\mu_X)(y-\mu_Y)\)

Well, it tells us this:

70809010011012040302010IQDELINQUENCYINDEXμYμXy-μY>oχ-μx<oχ-μx>oy-μY<0

That is, in the upper left quadrant, the difference between any \((x,y)\) data point and the \(x=\mu_X\) line is negative; and the difference between any \((x,y)\) data point and the \(y=\mu_Y\) line is positive. Therefore, any \((x,y)\) data point in the upper left quadrant produces a negative product. Now for the lower right quadrant, where most the remaining points lie. In the lower right quadrant, the difference between any \((x,y)\)data point and the \(x=\mu_X\) line is positive; and the difference between any \((x,y)\) data point and the \(y=\mu_Y\) line is negative. Therefore, any \((x,y)\) data point in the lower left quadrant also produces a negative product. Now there are a few data points that lie in the upper right and lower left quadrants that would produce a positive product. But, since most of the data points produce negative products, the sum of the products would still be negative. In general, when there is a negative linear relationship between \(X\) and \(Y\), the sign of the correlation coefficient is going to be negative. Again, makes intuitive sense!

Questions #2, #3, #4

As it turns out, answering the last three questions is going to take a bit of preliminary work before we arrive at the final answers. To make our work concrete, let's suppose that the random variables \(X\) and \(Y\) have a trinomial distribution with \(n=2, p_1=\frac{1}{4}, p_2=\frac{1}{2}\), and \(0\le x+y\le 2\). For trinomial random variables, we typically represent the joint probability mass function as a formula. In this case, let's represent the joint probability mass function as a graph:

4/164/161/164/162/161/16x001212Y

trinomial pmf

Each of the black dots (•) represents an element of the joint support \(S\). As we should expect with a trinomial, the support is triangular. The probabilities that \(X=x\) and \(Y=y\) are indicated in blue. For example, the probability that \(X=0\) and \(Y=1\) is \(\frac{4}{16}\). You can verify these probabilities, if you are so inclined, using the formula for the trinomial p.m.f. What we want to do here, though, is explore the correlation between \(X\) and \(Y\). Now, we'll soon see that we can learn something about the correlation \(\rho_{XY}\) by considering the best fitting line through the \((x,y)\) points in the support. Specifically, consider the best fitting line passing through the point \((\mu_X, \mu_Y)\). We don't yet know what the best fitting line is, but we could "eyeball" such a line on our graph. That's what the red line is here, an "eyeballed" best fitting line:

x001212Y4/164/161/164/162/161/16μX,μYμY=1μX-1/2

As the plot suggests, the mean of \(X\) is \(\frac{1}{2}\) and the mean of \(Y\) is 1 (that's because \(X\) is binomial with \(n=2\) and \(p_1=\frac{1}{4}\), and \(Y\) is binomial with \(n=2\) and \(p_2=\frac{1}{2}\)). Now, what we want to do is find the formula for the best (red) fitting line passing through the point \((\mu_X, \mu_Y)\). Well, we know that two points determine a line. So, along with the \((\mu_X, \mu_Y)\) point, let's pick an arbitrary point (x,y) on the line:

x001212Y4/164/161/164/162/161/16μX,μYy-μYμY=1μX-1/2χ-μXχ,y

Then, we know that the slope of the line is rise over run. That is:

\(slope=\dfrac{rise}{run}=\dfrac{y-\mu_Y}{x-\mu_X}=b\)

and the line is therefore of the form:

\(y-\mu_Y=b(x-\mu_X)\) or \(y=\mu_Y+b(x-\mu_X)\)

Now to find the best fitting line, we'll use the principle of least squares. That is, we'll find the slope \(b\) that minimizes the squared vertical distances between every point \((x_0, y_0)\) in the joint support \(S\) and the point on the line:

\((x_0,\mu_Y+b(x_0-\mu_X))\)

as illustrated here in green:

x001212Y4/164/161/164/162/161/16μX,μYy-μYμY=1μX-1/2χ-μXχ,yVERTICALDISTANCEχOyOO,μY+b(χO-μx))

That is, we need to find the \(b\) that minimizes:

\(K(b)=E\{[(Y-\mu_Y)-b(X-\mu_X)]^2\}\)

The resulting line is called the least squares regression line. What is the least squares regression line?

Solution

Before differentiating, let's start with simplifying the thing that we are trying to minimize:

\(K(b)=E\{[(Y-\mu_Y)-b(X-\mu_X)]^2\}\)

getting:

Now, to find the slope \(b\) that minimizes \(K(b)\), the expected squared vertical distances, we need to differentiate \(K(b)\) with respect to \(b\), and set the resulting derivative to 0. Doing so, we get:

\(K'(b)=-2\rho \sigma_X\sigma_Y + 2b\sigma^2_X \equiv 0\)

Then, solving for \(b\), we first get:

\(b \sigma^2_X=\rho \sigma_X\sigma_Y \)

and then finally:

\(b=\rho \dfrac{\sigma_Y}{\sigma_X}\)

Note that \(b\) does indeed minimize \(K(b)\), because the second derivative of \(K(b)\) is positive. That is:

\(K''(b)=2\sigma^2_X >0\)

Now, we can substitute what we have found for the slope \(b\) into our equation:

\(y=\mu_Y +b(x-\mu_X)\)

getting the least squares regression line:

\(y=\mu_Y + \rho \left(\dfrac{\sigma_Y}{\sigma_X}\right) (x-\mu_X)\)

By the way, note that, because the standard deviations of \(X\) and \(Y\) are positive, if the correlation coefficient \(\rho_{XY}\) is positive, then the slope of the least squares line is also positive. Similarly, if the correlation coefficient \(\rho_{XY}\) is negative, then the slope of the least squares line is also negative.

Now that we've found the \(b\) that minimizes \(K(b)\), what is the value of \(K(b)\) at its minimum \(b=\rho \dfrac{\sigma_Y}{\sigma_X}\)?

Solution

Substituting \(b=\rho \dfrac{\sigma_Y}{\sigma_X}\) into our simplified formula for \(K(b)\):

\(K(b)=\sigma^2_Y-2b\rho \sigma_X\sigma_Y +b^2 \sigma^2_X \)

we get:

\begin{align} K\left(\rho \dfrac{\sigma_Y}{\sigma_X}\right) &= \sigma^2_Y-2\left(\rho \dfrac{\sigma_Y}{\sigma_X}\right)\rho \sigma_X\sigma_Y+\left(\rho \dfrac{\sigma_Y}{\sigma_X}\right)^2 \sigma^2_X \\ &= \sigma^2_Y-2\rho^2 \sigma^2_Y+\rho^2 \sigma^2_Y\\ &= \sigma^2_Y(1-\rho^2) \end{align}

That is:

\(K\left(\rho \dfrac{\sigma_Y}{\sigma_X}\right) =\sigma^2_Y(1-\rho^2)\)

Okay, have we lost sight of what we are doing here? Remember that started way back when trying to answer three questions. Well, all of our hard work now makes the answers to the three questions rather straightforward. Let's take a look!

Why is \(-1 \leq \rho_{XY} \leq 1\)? Well, \(K(b)\) is an expectation of squared terms, so \(K(b)\) is necessarily non-negative. That is:

\(K(b)=\sigma^2_Y(1-\rho^2)\geq 0\)

And because the variance \(\sigma^2_Y\) is necessarily nonnegative, that implies that:

\((1-\rho^2)\geq 0\)

which implies that:

\(-\rho^2 \geq -1\)

which implies that:

\(\rho^2 \leq 1\)

and which finally implies that:

\(-1 \leq \rho \leq 1\)

Phew! Done! We have now answered the third question. Let's now tackle the second and fourth questions. Why is \(\rho_{XY}\) a measure of linearrelationship? And why does \(\rho_{XY}\) close to −1 or +1 indicate a strong linear relationship? Well, we defined \(K(b)\) so that it measures the distance of the points \((x_0,y_0)\) in the joint support \(S\) to a line. Therefore, \(\rho_{XY}\) necessarily must concern a linear relationship, and no other. Now, we can take it a step further. The smaller \(K(b)\) is, the closer the points are to the line:

  • \(K(b)\) is smallest, 0, when \(\rho_{XY}\) is −1 or +1. In that case, the points fall right on the line, indicating a perfect linear relationship.
  • \(K(b)\) is largest, \(\sigma^2_Y\) , when \(\rho_{XY}\) is 0. In that case, the points fall far away from the line, indicating a weak linear relationship.

So, there we have it! All four questions posed, and all four questions answered! We should all now have a fairly good understanding of the value of knowing the correlation between two random variables \(X\) and \(Y\).


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility