##
Measures of Association: Covariance, Correlation
Section* *

Association is concerned with how each variable is related to the other variable(s). In this case, the first measure that we will consider is the covariance between two variables *j* and *k*.

**Population covariance** is a measure of the association between pairs of variables in a population.

- Population Covariance
- The population covariance between variables
*j*and*k*is - \(\sigma_{jk} = E\{(X_{ij}-\mu_j)(X_{ik}-\mu_k)\}\quad\mbox{for }i=1,\ldots,n\)

Note that the product of the residuals \( \left( X_{ij} - \mu_{j} \right) \) and \( \left(X_{ik} - \mu_{k} \right) \) for variables *j* and *k*, respectively, is a function of the random variables \(X_{ij}\) and \(X_{ik}\). Therefore, \(\left( X _ { i j } - \mu _ { j } \right) \left( X _ { i k } - \mu _ { k } \right)\) is itself random, and has a population mean. The population covariance is defined to be the population mean of this product of residuals. We see that if either both variables are greater than their respective means, or if they are both less than their respective means, then the product of the residuals will be positive. Thus, if the value of variable *j* tends to be greater than its mean when the value of variable *k* is larger than its mean, and if the value of variable *j* tends to be less than its mean when the value of variable *k* is smaller than its mean, then the covariance will be positive. Positive population covariances mean that the two variables are positively associated; variable *j* tends to increase with increasing values of variable *k*.

A negative association can also occur. If one variable tends to be greater than its mean when the other variable is less than its mean, the product of the residuals will be negative, and you will obtain a negative population covariance. Variable *j* will tend to decrease with increasing values of variable *k*.

The population covariance \(\sigma_{jk}\) between variables *j* and *k* can be estimated by the sample covariance.

- Sample Covariance
- This can be calculated by
- \begin{align} s_{jk} &= \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij}-\bar{x}_j)(X_{ik}-\bar{x}_k)\\&=\frac{\sum_{i=1}^{n}X_{ij}X_{ik}-(\sum_{i=1}^{n}X_{ij})(\sum_{i=1}^{n}X_{ik})/n}{n-1} \end{align}

Just like in the formula for variance we have two expressions that make up this formula. The first half of the formula is most suitable for understanding the interpretation of the sample covariance, and the second half of the formula is used for calculation.

Looking at the first half of the expression, the product inside the sum is the residual differences between variable *j* and its mean times the residual differences between variable *k* and its mean. We can see that if either both variables tend to be greater than their respective means or less than their respective means, then the product of the residuals will tend to be positive leading to a positive sample covariance.

Conversely, if one variable takes values that are greater than its mean when the opposite variable takes a value less than its mean, then the product will take a negative value. In the end, when you add up this product over all of the observations, it will result in a negative covariance.

So, in effect, a positive covariance would indicate a positive association between the variables *j* and *k*. And a negative association is when the covariance is negative.

For computational purposes, we will use the second half of the formula. For each subject, the product of the two variables is obtained, and then the products are summed to obtain the first term in the numerator. The second term in the numerator is obtained by taking the product of the sums of variables over the *n* subjects, then dividing the results by the sample size *n*. The difference between the first and second terms is then divided by *n* -1 to obtain the covariance value.

Again, sample covariance is a function of the random data, and hence, is random itself. As before, the population mean of the sample covariance *s _{jk}* is equal to the population covariance

*σ*; i.e.,

_{jk}\(E(s_{jk})=\sigma_{jk}\)

That is, the sample covariance \(s_{jk}\) is unbiased for the population covariance \(\sigma_{jk}\).

The sample covariance is a measure of the association between a pair of variables:

### \(s_{jk}\) = 0

This implies that the two variables are uncorrelated. (Note that this does not necessarily imply independence, we'll get back to this later.)

### \(s_{jk}\) > 0

This implies that the two variables are positively correlated; i.e., values of variable *j* tend to increase with increasing values of variable *k*. The larger the covariance, the stronger the positive association between the two variables.

### \(s_{jk}\) < 0

This implies that the two variables are negatively correlated; i.e., values of variable *j* tend to decrease with increasing values of variable *k*. The smaller the covariance, the stronger the negative association between the two variables.

Recall, that we had collected all of the population means of the* p *variables into a mean vector. Likewise, the population variances and covariances can be collected into the **population variance-covariance matrix**: This is also known by the name of **population dispersion** matrix.

- Population variance-covariance matrix
- \(\Sigma = \left(\begin{array}{cccc}\sigma^2_1 & \sigma_{12} & \dots & \sigma_{1p}\\ \sigma_{21} & \sigma^2_{2} & \dots & \sigma_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{p1} & \sigma_{p2} & \dots &\sigma^2_p\end{array}\right)\)

Note that the population variances appear along the diagonal of this matrix, and the covariance appear in the off-diagonal elements. So, the covariance between variables *j* and *k* will appear in row *j* and column *k* of this matrix.

The population variance-covariance matrix may be estimated by the sample variance-covariance matrix. The population variances and covariances in the above population variance-covariance matrix are replaced by the corresponding sample variances and covariances to obtain the **sample variance-covariance matrix**:

- Sample variance-covariance matrix
- \( S = \left(\begin{array}{cccc}s^2_1 & s_{12} & \dots & s_{1p}\\ s_{21} & s^2_2 & \dots & s_{2p}\\ \vdots & \vdots & \ddots & \vdots \\ s_{p1} & s_{p2} & \dots & s^2_{p}\end{array}\right)\)

Note that the sample variances appear along diagonal of this matrix and the covariances appear in the off-diagonal elements. So the covariance between variables *j* and *k* will appear in the *jk-*th element of this matrix.

### Note!

**S** (the sample variance-covariance matrix) is symmetric; i.e., *\(s_{jk}\) * = *\(s_{kj}\) *.

**S** is unbiased for the population variance covariance matrix \(Σ\) ; i.e.,

\(E(S) = \left(\begin{array}{cccc} E(s^2_1) & E(s_{12}) & \dots & E(s_{1p}) \\ E(s_{21}) & E(s^2_{2}) & \dots & E(s_{2p})\\ \vdots & \vdots & \ddots & \vdots \\ E(s_{p1}) & E(s_{p2}) & \dots & E(s^2_p)\end{array}\right)=\Sigma\).

Because this matrix is a function of our random data, this means that the elements of this matrix are also going to be random, and the matrix, on the whole, is random as well. The statement '\(Σ\) is unbiased' means that the mean of each element of that matrix is equal to the corresponding elements of the population.

In matrix notation, the sample variance-covariance matrix may be computed used the following expressions:

\begin{align} S &= \frac{1}{n-1}\sum_{i=1}^{n}(X_i-\bar{x})(X_i-\bar{x})'\\ &= \frac{\sum_{i=1}^{n}X_iX_i'-(\sum_{i=1}^{n}X_i)(\sum_{i=1}^{n}X_i)'/n}{n-1} \end{align}

Just as we have seen in the previous formulas, the first half of the formula is used in interpretation, and the second half of the formula is what is used for calculation purposes.

Looking at the second term you can see that the first term in the numerator involves taking the data vector for each subject and multiplying by its transpose. The resulting matrices are then added over the *n* subjects. To obtain the second term in the numerator, first compute the sum of the data vectors over the *n* subjects, then take the resulting vector and multiply by its transpose; then divide the resulting matrix by the number of subjects *n*. Take the difference between the two terms in the numerator and divide by *n* - 1.

##
Example 1-2
Section* *

Suppose that we have observed the following n = 4 observations for variables \(x_{1}\) and \(x_{2}\) .

\(x_{1}\) | \(x_{2}\) |
---|---|

6 | 3 |

10 | 4 |

12 | 7 |

12 | 6 |

The sample means are \(\bar{x}_1\) = 10 and \(\bar{x}_2\) = 5. The maximum likelihood estimate of the covariance is the average product of deviations from the mean:

\begin{align} s_{12}&=\dfrac{(6-10)(3-5)+(10-10)(4-5)+(12-10)(7-5)+(12-10)(6-5)}{4-1}\\&=\dfrac{8+0+4+2}{4-1}=4.67 \end{align}

The positive value reflects the fact that as \(x_{1}\) increases, \(x_{2}\) also tends to increase.

**Note**! The magnitude of the covariance value is not particularly helpful as it is a function of the magnitudes (scales) of the two variables. This quantity is a function of the variability of the two variables, and so, it is hard to tease out the effects of the association between the two variables from the effects of their dispersions.

Note, however, that the covariance between variables *i* and *j* must lie between the product of the two-component standard deviations of variables *i* and *j*, and negative of that same product:

\(-s_i s_j \le s_{ij} \le s_i s_j\)

##
Example 1-3: Body Measurements (Covariance)
Section* *

In an undergraduate statistics class, *n *= 30 females reported their heights (inches), and also measured their left forearm length (cm), left foot length (cm), and head circumference (cm). The sample variance-covariance matrix is the following:

Height | LeftArm | LeftFoot | HeadCirc | |
---|---|---|---|---|

Height | 8.740 | 3.022 | 2.772 | 0.289 |

LeftArm | 3.022 | 2.402 | 1.233 | 0.233 |

LeftFoot | 2.772 | 1.234 | 1.908 | 0.118 |

HeadCirc | 0.289 | 0.223 | 0.118 | 3.434 |

Notice that the matrix has four row and four columns because there are four variables being considered. Also, notice that the matrix is symmetric.

Here are a few examples of the information in the matrix:

- The variance of the height variable is 8.74. Thus the standard deviation is \(\sqrt{8.74} = 2.956\).
- The variance of the left foot measurement is 1.908 (in the 3rd diagonal element). Thus the standard deviation for this variable is \(\sqrt{1.908}=1.381\).
- The covariance between height and left arm is 3.022, found in the 1st row, 2nd column and also in the 2nd row, 1st column.
- The covariance between left foot and left arm is 1.234, found in the 3rd row, 2nd column and also in the 2nd row, 3rd column.

All covariance values are positive so all pairwise associations are positive. But, the magnitudes do not tell us about the strength of the associations. To assess the strength of an association, we use correlation values. This suggests an alternative measure of association.

##
Correlation Matrix
Section* *

- Correlation
- The population correlation is defined to be equal to the population covariance divided by the product of the population standard deviations:

- Correlation
- \(\rho_{jk} = \dfrac{\sigma_{jk}}{\sigma_j\sigma_k}\)

The population correlation may be estimated by substituting into the formula the sample covariances and standard deviations:

\(r_{jk}=\dfrac{s_{jk}}{s_js_k}=\dfrac{\sum_{i=1}^{n}X_{ij}X_{ik}-(\sum_{i=1}^{n}X_{ij})(\sum_{i=1}^{n}X_{ik})/n}{\sqrt{\{\sum_{i=1}^{n}X^2_{ij}-(\sum_{i=1}^{n}X_{ij})^2/n\}\{\sum_{i=1}^{n}X^2_{ik}-(\sum_{i=1}^{n}X_{ik})^2/n\}}}\)

It is essential to note that the population and the sample correlation must lie between -1 and 1.

\(-1 \le \rho_{jk} \le 1\)

\(-1 \le r_{jk} \le 1\)

Therefore:

- \(\rho_{jk}\) = 0 indicates, as you might expect, the two variables are uncorrelated.
- \(\rho_{jk}\) close to +1 will indicate a strong positive dependence
- \(\rho_{jk}\) close to -1 indicates a strong negative dependence

Sample correlation coefficients also have a similar interpretation.

For a collection of *p *variables, the correlation matrix is a *p *× *p *matrix that displays the correlations between pairs of variables. For instance, the value in the \(j^{th}\) row and \(k^{th}\) column gives the correlation between variables \(x_{j}\) and \(x_{k}\) . The correlation matrix is symmetric so that the value in the \(k^{th}\) row and \(j^{th}\) column is also the correlation between variables \(x_{j}\) and \(x_{k}\). The diagonal elements of the correlation matrix are all identically equal to 1.

- Sample Correlation Matrix
- The
**sample correlation matrix**is denoted as**R**. - \(\textbf{R} = \left(\begin{array}{cccc} 1 & r_{12} & \dots & r_{1p}\\ r_{21} & 1 & \dots & r_{2p}\\ \vdots & \vdots & \ddots & \vdots \\ r_{p1} & r_{p2} & \dots & 1\end{array}\right)\)

##
Example 1-4: Body Measurements (Correlations)
Section* *

The following covariance matrix shows the pairwise covariances for the height, left forearm, left foot, and head circumference measurements of *n* = 30 female college students.

Height | LeftArm | LeftFoot | HeadCirc | |
---|---|---|---|---|

Height | 8.740 | 3.022 | 2.772 | 0.289 |

LeftArm | 3.022 | 2.402 | 1.233 | 0.233 |

LeftFoot | 2.772 | 1.234 | 1.908 | 0.118 |

HeadCirc | 0.289 | 0.223 | 0.118 | 3.434 |

Here are two examples of calculating a correlation coefficient:

- The correlation between height and left forearm is \(\dfrac{3.022}{\sqrt{8.74}\sqrt{2.402}}=0.66\).
- The correlation between head circumference and left foot is \(\dfrac{0.118}{\sqrt{3.434}\sqrt{1.908}}=0.046\).

The complete sample correlation matrix for this example is the following:

Height | LeftArm | LeftFoot | HeadCirc | |
---|---|---|---|---|

Height | 1 | 0.66 | 0.68 | 0.053 |

LeftArm | 0.66 | 1 | 0.58 | 0.078 |

LeftFoot | 0.68 | 0.58 | 1 | 0.046 |

HeadCirc | 0.053 | 0.078 | 0.046 | 1 |

Overall, we see moderately strong linear associations among the variables height, left arm, and left foot and relatively weak (almost 0) associations between head circumference and the other three variables.

In practice, use scatter plots of the variables to understand the associations between variables fully. It is not a good idea to rely on correlations without seeing the plots. Correlation values are affected by outliers and curvilinearity.