5.3 - Inferences for Correlations

5.3 - Inferences for Correlations

Let us consider testing the null hypothesis that there is zero correlation between two variables \(X_{j}\) and \(X_{k}\). Mathematically we write this as shown below:

\(H_0\colon \rho_{jk}=0\) against \(H_a\colon \rho_{jk} \ne 0 \)

Recall that the correlation is estimated by sample correlation \(r_{jk}\) given in the expression below:

\(r_{jk} = \dfrac{s_{jk}}{\sqrt{s^2_js^2_k}}\)

Here we have the sample covariance between the two variables divided by the square root of the product of the individual variances.

We shall assume that the pair of variables \(X_{j}\)and \(X_{k}\) are independently sampled from a bivariate normal distribution throughout this discussion; that is:

\(\left(\begin{array}{c}X_{1j}\\X_{1k} \end{array}\right)\), \(\left(\begin{array}{c}X_{2j}\\X_{2k} \end{array}\right)\), \(\dots\), \(\left(\begin{array}{c}X_{nj}\\X_{nk} \end{array}\right)\)

are independently sampled from a bivariate normal distribution.

To test the null hypothesis, we form the test statistic, t as below

\(t = r_{jk}\sqrt{\frac{n-2}{1-r^2_{jk}}}\)  \(\dot{\sim}\)  \( t_{n-2}\)

Under the null hypothesis, \(H_{o}\), this test statistic will be approximately distributed as t with n - 2 degrees of freedom.

Note! This approximation holds for larger samples. We will reject the null hypothesis, \(H_{o}\), at level \(α\) if the absolute value of the test statistic, t, is greater than the critical value from the t-table with n - 2 degrees of freedom; that is if:

\(|t| > t_{n-2, \alpha/2}\)

To illustrate these concepts let's return to our example dataset, the Wechsler Adult Intelligence Scale.

Example 5-5: Wechsler Adult Intelligence Scale

Using SAS

This data was analyzed using the SAS program in our last lesson, (Multivariate Normal Distribution), which yielded the computer output below.

Download the:

Using Minitab

Click on the video below to see how you can find the total variance of the Wechsler Adult Intelligence Scale data.

 

Recall that these are data on n = 37 subjects taking the Wechsler Adult Intelligence Test. This test was broken up into four components:

  • Information
  • Similarities
  • Arithmetic
  • Picture Completion

Looking at the computer output we have summarized the correlations among variables in the table below:

 
Information
Similarities
Arithmetic
Picture
Information
1.00000
0.77153
0.56583
0.31816
Similarities
0.77153
1.00000
0.51295
0.08135
Arithmetic
0.56583
0.51295
1.00000
0.27988
Picture
0.31816
0.08135
0.27988
1.00000

For example, the correlation between Similarities and Information is 0.77153.

Let's consider testing the null hypothesis that there is no correlation between Information and Similarities. This would be written mathematically as shown below:

\(H_0\colon \rho_{12}=0\)

We can then substitute values into the formula to compute the test-statistic using the values from this example:

\begin{align} t &= r_{jk}\sqrt{\frac{n-2}{1-r^2_{jk}}}\\[10pt] &= 0.77153 \sqrt{\frac{37-2}{1-0.77153^2}}\\[10pt] &= 7.175 \end{align}

Looking at our t-table for 35 degrees of freedom and an \(\alpha\) level of .005, we get a critical value of \(t _ { ( d f , 1 - \alpha / 2 ) } = t _ { 35,0.9975 } = 3.030\). Therefore, we are going to look at the critical value under 0.0025 in the table (since 35 does not appear to use the closest df that does not exceed 35 which is 30) and in this case it is 3.030, meaning that \(t _ { ( d f , 1 - \alpha / 2 ) } = t _ { 35,0.9975 } = 3.030\) is close to 3.030.

Note! Some text tables provide the right tail probability (the graph at the top will have the area in the right tail shaded in) while other texts will provide a table with the cumulative probability - the graph will be shaded into the left. The concept is the same. For example, if alpha was 0.01 then using the first text you would look under 0.005 and in the second text look under 0.995.

 Because

\(7.175 > 3.030 = t_{35, 0.9975}\),

we can reject the null hypothesis that Information and Similarities scores are uncorrelated at the \(\alpha\) < 0.005 level.

Our conclusion here is that Similarities scores increase with increasing Information scores (t = 7.175; d.f. = 35; p < 0.0001). You will note here that we are not simply concluding that the results are significant. When drawing conclusions it is never adequate to simply state that the results are significant. In all cases, you should seek to describe what the results tell you about this data. In this case, because we rejected the null hypothesis we can conclude that the correlation is not equal to zero.  Furthermore, because the actual sample correlation is greater than zero and our p-value is so small, we can conclude that there is a positive association between the two variables, and hence our conclusion that Similarities scores tend to increase with increasing values of Information scores.

You will also note that the conclusion includes information from the test. You should always back up your conclusions with the appropriate evidence: the test statistic, degrees of freedom (if appropriate), and p-value. Here the appropriate evidence is given by the test statistic t = 7.175; the degrees of freedom for the test, 35, and the p-value, less than 0.0001 as indicated by the computer print out. The p-value appears below each correlation coefficient in the SAS output.

Confidence Interval for \(p_{jk}\)

Once we conclude that there is a positive or negative correlation between two variables the next thing we might want to do is compute a confidence interval for the correlation. This confidence interval will give us a range of reasonable values for the correlation itself. The sample correlation, because it is bounded between -1 and 1 is typically not normally distributed or even approximately so. If the population correlation is near zero, the distribution of sample correlations may be approximately bell-shaped in distribution around zero. However, if the population correlation is near +1 or -1, the distribution of sample correlations will be skewed. For example, if \(p_{jk}= .9\), the distribution of sample correlations will be more concentrated near .9.  Because they cannot exceed 1, they have more room to spread out to the left of .9, which causes a left-skewed shape. To adjust for this asymmetry or the skewness of distribution, we apply a transformation of the correlation coefficients. In particular, we are going to apply Fisher's transformation which is given in the expression below in Step 1 of our procedure for computing confidence intervals for the correlation coefficient.

Steps

  1. Compute Fisher’s transformation

    \(z_{jk}=\frac{1}{2}\log\dfrac{1+r_{jk}}{1-r_{jk}}\)

    Here we have one half of the natural log of 1 plus the correlation, divided by one minus the correlation.

    Note! In this course, whenever log is mentioned, unless specified otherwise, log stands for the natural log.

    For large samples, this transform correlation coefficient z is going to be approximately normally distributed with the mean equal to same transformation of the population correlation, as shown below, and a variance of 1 over the sample size minus 3.

    \(z_{jk}\) \(\dot{\sim}\) \(N\left(\dfrac{1}{2}\log\dfrac{1+\rho_{jk}}{1-\rho_{jk}}, \dfrac{1}{n-3}\right)\)

  2. Compute a (1 - \(\alpha\)) x 100% confidence interval for the Fisher transform of the population correlation.

    \(\dfrac{1}{2}\log \dfrac{1+\rho_{jk}}{1-\rho_{jk}}\)

    That is, one half log of 1 plus the correlation divided by 1 minus the correlation. In other words, this confidence interval is given by the expression below:

    \(\left(\underset{Z_l}{\underbrace{Z_{jk}-\frac{Z_{\alpha/2}}{\sqrt{n-3}}}}, \underset{Z_U}{\underbrace{Z_{jk}+\frac{Z_{\alpha/2}}{\sqrt{n-3}}}}\right)\)

    Here we take the value of Fisher's transform Z, plus and minus the critical value from the z table, divided by the square root of n - 3. The lower bound we will call the \(Z_{1}\) and the upper bound we will call the \(Z_{u}\).

  3. Back transform the confidence values to obtain the desired confidence interval for \(\rho_{jk}\) This is given in the expression below:

    \(\left(\dfrac{e^{2Z_l}-1}{e^{2Z_l}+1},\dfrac{e^{2Z_U}-1}{e^{2Z_U}+1}\right)\)

    The first term we see is a function of the lower bound, the \(Z_{1}\). The second term is a function of the upper bound or \(Z_{u}\).

Let's return to the Wechsler Adult Intelligence Data to see how these procedures are carried out.

Example 5-6: Wechsler Adult Intelligence Data

Recall that the sample correlation between Similarities and Information was \(r_{12} = 0.77153\).

Step 1: Compute the Fisher transform:

\begin{align} Z_{12} &= \frac{1}{2}\log \frac{1+r_{12}}{1-r_{12}}\\[5pt] &= \frac{1}{2}\log\frac{1+0.77153}{1-0.77153}\\[5pt] &= 1.024 \end{align}

You should confirm this value on your own.

Step 2: Next, compute the 95% confidence interval for the Fisher transform, \(\frac{1}{2}\log \frac{1+\rho_{12}}{1-\rho_{12}}\) :

\begin{align} Z_l &=  Z_{12}-Z_{0.025}/\sqrt{n-3} \\ &= 1.024 - \frac{1.96}{\sqrt{37-3}} \\ &= 0.6880 \end{align}

\begin{align} Z_U &=  Z_{12}+Z_{0.025}/\sqrt{n-3} \\&= 1.024 + \frac{1.96}{\sqrt{37-3}} \\&= 1.3602 \end{align}

In other words, the value 1.024 plus or minus the critical value from the normal table, at \(α/2 = 0.025\), which in this case is 1.96. Divide by the square root of n minus 3. Subtracting the result from 1.024 yields the lower bound of 0.6880. Adding the result to 1.024 yields the upper bound of 1.3602.

Step 3: Carry out the back-transform to obtain the 95% confidence interval for ρ12. This is shown in the expression below:

\(\left(\dfrac{\exp\{2Z_l\}-1}{\exp\{2Z_l\}+1},\dfrac{\exp\{2Z_U\}-1}{\exp\{2Z_U\}+1}\right)\) 

\(\left(\dfrac{\exp\{2 \times 0.6880\}-1}{\exp\{2 \times 0.6880\}+1},\dfrac{\exp\{2\times 1.3602\}-1}{\exp\{2\times 1.3602\}+1}\right)\)

\((0.5967,0.8764)\)

This yields the interval from 0.5967 to 0.8764.

Conclusion: In this case, we can conclude that we are 95% confident that the interval (0.5967, 0.8764) contains the correlation between Information and Similarities scores.

Note! The interpretation of this interval. We did not say that with 95% probability the correlation between Information and Similarities lies between the interval. This statement would imply that the population correlation is random. In fact, the population correlation is a fixed quantity. The only quantities that are random are the bounds of the confidence interval, which are a function of the random data and so are also random---before the data is observed. After the data is observed and the interval is computed, there is no longer any random quantity. Therefore, we use the word "confidence", rather than "probability", to describe the parameter falling inside the interval.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility