Lesson 5: Sample Mean Vector and Sample Correlation and Related Inference Problems

Lesson 5: Sample Mean Vector and Sample Correlation and Related Inference Problems

Overview

In this lesson, we consider the properties of the sample mean vector and the sample correlations which we had defined earlier. We will also consider estimation and hypothesis testing problems on the population mean and correlation coefficients.

This lesson will explore the following questions...

Sample Mean Vectors

  • What is the distribution of  \(\bar{x}\) when data come from a multivariate normal distribution?
  • What are the properties of  \(\bar{x}\) when data do NOT come from a multivariate normal distribution but the sample size n is large
  • How to construct a confidence interval for a single multivariate normal population mean
  • How to construct confidence intervals for several multivariate normal population means simultaneously

Sample Correlations

  • How can we test the null hypothesis that there is zero correlation between two variables?
  • What can we conclude from such hypothesis tests?
  • How can we assess uncertainty regarding estimated correlation coefficients using confidence intervals?
  • What is the appropriate interpretation for confidence intervals?

Objectives

Upon completion of this lesson, you should be able to answer the following questions:

  • Describe the distribution of \(\bar{x}\) when data come from a Multivariate Normal distribution
  • Construct simultaneous confidence intervals for several multivariate normal population means
  • Conduct hypothesis testing on the correlation between two variables
  • Assess confidence intervals for correlation between two variables

5.1 - Distribution of Sample Mean Vector

5.1 - Distribution of Sample Mean Vector

As noted previously \(\bar{\textbf{x}}\) is a function of random data, and hence \(\bar{\textbf{x}}\) is also a random vector with a mean, a variance-covariance matrix, and a distribution. We have already seen that the mean of the sample mean vector is equal to the population mean vector \(\boldsymbol{\mu}\).

Variance

Before considering the sample variance-covariance matrix for the mean vector \(\bar{\textbf{x}}\), let us revisit the univariate setting.

Univariate Setting

You should recall from introductory statistics that the population variance of the sample mean, generated from independent samples of size n, is equal to the population variance, \(\sigma^{2}\) divided by n.

\(\text{var}(\bar{x}) = \dfrac{\sigma^2}{n}\)

This, of course, is a function of the unknown population variance \(\sigma^{2}\). We can estimate this by simply substituting \(s^{2}\) in the sample variance \(\sigma^{2}\) yielding our estimate for the variance of the population mean as shown below:

\(\widehat{\text{var}}(\bar{x}) = \dfrac{s^2}{n}\)

If we were to take the square root of this quantity we would obtain the standard error of the mean. The standard error of the mean is a measure of the uncertainty of our estimate of the population mean. If the standard error is large, then we are less confident of our estimate of the mean. Conversely, if the standard error is small, then we are more confident in our estimate. What is meant by large or small depends on the application at hand. But in any case, because the standard error is a decreasing function of sample size, the larger our sample the more confident we can be of our estimate of the population mean.

Multivariate Setting

The population variance-covariance matrix replaces the variance of the \(\bar{x}\)’s generated from independent samples of size n, taking a form similar to the univariate setting as shown below:

\(\text{var}(\bar{\textbf{x}}) = \dfrac{1}{n}\Sigma\)

Again, this is a function of the unknown population variance-covariance matrix \(\Sigma\). An estimate of the variance-covariance matrix of \(\bar{\textbf{x}}\) can be obtained by substituting the sample variance-covariance matrix S for the population variance-covariance matrix \(\Sigma\), yielding the estimate as shown below:

\(\widehat{\text{var}}(\bar{\textbf{x}}) = \dfrac{1}{n}\textbf{S}\)

Distribution

Let's consider the distribution of the sample mean vector, first looking at the univariate setting and comparing this to the multivariate setting.

Univariate Setting

Here we are going to make the additional assumption that \(X _ { 1 } , X _ { 2 , \dots } X _ { n } \) are independently sampled from a normal distribution with mean \(\mu\) and variance \(\sigma_{2}\). In this case, \(\bar{x}\) is normally distributed as

 

\(\bar{x} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\)

Multivariate Setting

Similarly, for the multivariate setting, we are going to assume that the data vectors \(\boldsymbol{X _ { 1 }, X _ { 2, \dots } X _ { n }} \) are independently sampled from a multivariate normal distribution with mean vector \(\boldsymbol{\mu}\) and variance-covariance matrix \(\Sigma\). Then, in this case, the sample mean vector, \(\bar{\textbf{x}}\), is distributed as multivariate normal with mean vector \(\boldsymbol{\mu}\) and variance-covariance matrix \(\frac{1}{n}\Sigma\), the variance-covariance matrix for \(\bar{\textbf{x}}\). In statistical notation we write:

 

\(\bar{\textbf{x}} \sim N \left( \boldsymbol {\mu}, \frac{1}{n}\Sigma \right)\)

Law of Large Numbers

At this point, we will drop the assumption that the individual observations are sampled from a normal distribution and look at the laws of large numbers. These will hold regardless of the distribution of the individual observations.

Univariate Setting

In the univariate setting, we see that if the data are independently sampled, then the sample mean, \(\bar{x}\), is going to converge (in probability) to the population mean \(\mu\). What does this mean exactly? It means that as the sample size gets larger and larger the sample mean will tend to approach the true value for a population \(\mu\).

 

Multivariate Setting

A similar result is involved in the multivariate setting, the sample mean vector, \(\bar{\textbf{x}}\), will also converge (in probability) to the mean vector \(\boldsymbol{\mu}\) As our sample size gets larger and larger, each of the individual components of that vector, \(\bar{x}_{j}\), will converge to the corresponding mean, \(\mu_{j}\).

 

\(\bar{x}_j \stackrel{p}\rightarrow \mu_j\)

Central Limit Theorem

Just as in the univariate setting we also have a multivariate Central Limit Theorem. But first, let's review the univariate Central Limit Theorem.

Univariate Setting

If all of our individual observations, \(X _ { 1 } , X _ { 2 , \dots } X _ { n }\), are independently sampled from a population with mean \(\mu\) and variance \(\sigma^{2}\), then, the sample mean, \(\bar{x}\), is approximately normally distributed with mean \(\mu\) and variance \({1\over n}\sigma^2\).

Note! In the distribution property described above, normality was a requirement. Under normality, even for small samples, the data are normally distributed. The Central Limit Theorem is a more general result that holds regardless of the distribution of the original data. The significance of CLT lies in the fact that the sample mean is approximately normally distributed for large samples whatever the distribution of the individual observations.

Multivariate Setting

A similar result is available in the multivariate setting. If our data vectors \(\boldsymbol{X _ { 1 }, X _ { 2 , \dots } X _ { n }}\) , are independently sampled from a population with mean vector \(\boldsymbol{\mu}\) and variance-covariance matrix \(\Sigma\), then the sample mean vector, \(\bar{\textbf{x}}\), is going to be approximately normally distributed with mean vector \(\boldsymbol{\mu}\) and variance-covariance matrix \(\frac{1}{n}\Sigma\).

This Central Limit Theorem is a key result that we will take advantage of later on in this course when we talk about hypothesis tests for individual mean vectors or collections of mean vectors under different treatment regimens.


5.2 - Interval Estimate of Population Mean

5.2 - Interval Estimate of Population Mean

Here we consider the joint estimation of a multivariate set of population means. That is, we have observed a set of p X-variables and may wish to estimate the population mean for each variable. In some instances, we may also want to estimate one or more linear combinations of population means. Our basic tool for estimating the unknown value of a population parameter is a confidence interval, an interval of values that is likely to include the unknown value of the parameter.

General Format for a Confidence Interval

The general format of a confidence interval estimate of a population mean is:

\(\text{Sample mean} \pm \text{Multiplier × Standard error of mean}\)

For variable \(X_{j}\), a confidence interval estimate of its population mean \(\mu_{j}\) is

\(\bar{x}_j \pm \text{Multiplier}\dfrac{s_j}{\sqrt{n}}\)

In this formula, \(\bar{x}_{j}\) is the sample mean, \(s_{j}\) is the sample standard deviation and n is the sample size. The multiplier value is a function of the confidence level, the sample size, and the strategy used for dealing with the multiple inference issue.

Strategies for Determining the Multiplier

The following list covers some common strategies:

  1. One-at-a-Time Confidence Intervals: This strategy essentially considers each mean separately and uses the desired confidence level (usually 95%) for every single interval.
  2. Bonferroni Method: With this method, we set a family-wide error rate and then divide this family error rate by the number of intervals to be computed to determine the error rate (and hence confidence level) for each individual interval.
  3. Simultaneous Confidence Region: This strategy uses properties of the multivariate normal distribution to define joint confidence intervals. The multiplier for this method is conservative because the family error rate applies to the family of all possible linear combinations of population means.

One at a Time Intervals

For a \(1 - \alpha\) confidence interval, the “one at a time” multiplier is the t-value such that the probability is \(1 - \alpha\) between –t and +t under a t-distribution with n - 1 degrees of freedom. Said another way, the value of t is such that the probability greater than +t is \(\alpha/2\).

Notationally, the one-at-a-time multiplier is:

\(\text{Multiplier} = t_{n-1}(\alpha/2)\)

With this notation, a confidence interval for \(\mu_{j}\) is computed as:

\(\bar{x}_j \pm t_{n-1}(\alpha/2)\frac{s_j}{\sqrt{n}}\)

Note! The notation for the t-multiplier can be confusing because it varies between textbooks and statistical software. For instance, Excel’s command to determine the p-value requires that you give the value of α whereas SAS requires that you give the cumulative probability \(1 - \alpha / 2\) for the desired t-value.

Example 5-1: One at a Time Intervals

Suppose that the sample size is n = 25 and we want a 95% confidence interval for the population mean. Thus \(\alpha = 0.05\). Our textbook would write the multiplier as \(t_{24}(.025)\). In Excel, the command =TINV(.05,24) will give the multiplier (value = 2.064). In SAS, a command such as t1=tinv(.975,24) will make the variable t1 that contains the desired multiplier.

Bonferroni Method Multiplier

When we determine confidence intervals for the population means of several variables, we are creating a family of confidence intervals. The family-wide error rate is the probability that at least one of the confidence intervals in the family will not capture the population mean. The family-wide confidence level = 1 – family-wide error rate.

Suppose that we have a family of p confidence intervals and the error rates for the individual intervals are \(\alpha _ { 1 }, \alpha _ { 2 }, \dots , \alpha _ { p }\). The Bonferroni Inequality states that the family wide-error rate is less than or equal to the sum of \(\alpha _ { 1 }, \alpha _ { 2 }, \dots , \alpha _ { p }\). That is family-wide error rate \(\leq \Sigma \alpha _ { i }\). In terms of the family-wide confidence that all intervals capture their population means, we can write this as \(1 - \Sigma \alpha _ { i } \leq\) family-wide confidence level.

Most often, we divide the desired family-wide error rate equally across the intervals that we will compute. If we are computing p confidence intervals with a desired family-wide confidence level of \(\alpha\), we use an error rate of \(\alpha / p\) (so confidence \(= 1 - (\alpha / p)\) for each individual interval. This guarantees that the family-wide confidence level will be greater than or equal to \(1 - \alpha\).

Suppose that we are calculating p intervals with a family error rate equal to \(\alpha\).

Notationally, the Bonferroni method multiplier is:

\(\text{Multiplier} = t_{n-1}(\alpha/2p)\)

A confidence interval for\(\mu_{j}\) is computed as:

\(\bar{x}_j \pm t_{n-1}(\alpha/2p)\frac{s_j}{\sqrt{n}}\)

Example 5-2: Bonferroni Method Multiplier

Suppose that n = 25. The family-wide error = 5% for a family confidence = 95%. We are computing intervals for p = 5 means. The error rate for each interval will be .05/5 = 1%. We might use the Excel command = TINV(.01,24) to find that the multiplier = 2.797. In SAS, we use the cumulative probability \(= 1- \alpha /2p\) so the command for finding the t-multiplier in this instance is something like t1=tinv(.995, 24).

Simultaneous Confidence Region Multiplier

This method is derived from the properties of the multivariate normal distribution. The multiplier applies to the family of all possible linear combinations of the population means considered, including the individual means. It is conservative (meaning that the multiplier tends to be larger than absolutely necessary). When family confidence is used, compare the value of this multiplier to the Bonferroni method multiplier and use the smaller of the two.

Notationally, the simultaneous confidence region multiplier is:

\(\text{Multiplier}=\sqrt{\frac{p(n-1)}{n-p}F_{p,n-p}(\alpha)}\)

\(F _ { p , n - p } ( \alpha )\) represents a value of F such that the probability greater than this value is α under an F-distribution with p and n - p degrees of freedom.

Example 5-3: Simultaneous Confidence Region Multiplier

Suppose that we have a sample size of n = 25 and we have p = 3 variables. With a 5% family error rate (and 95% family confidence), the F-value can be found in Excel using = FINV(.05, 3, 22) = 3.049. SAS uses cumulative probabilities so in this case, a command like f1= FINV(.95,3, 22) would make f1 the F-value. The multiplier in this example is

\(\sqrt{\frac{3(25-1)}{25-3}3.049}=3.159\)

This multiplier could be used for all confidence intervals for parameters that are linear combinations of the three population means (and for the three individual means).

Summary of Multipliers

The following table summarizes the three different multipliers and gives notes about using Excel and SAS.

Method Textbook notation for multiplier Excel notes SAS notes
One at a time: Confidence = (\(1 - \alpha)\) for each interval \(t _ { n - 1 } ( \alpha / 2 )\) To determine the t -value, enter the equation TINV(\(\alpha, \text{df}\))

To determine the t-value, create t1= tinv(\(1 - \alpha/2,\ n-1\))

Bonferroni Method: Confidence = \(1 - \alpha)\) for whole family

 \(t _ { n - 1 } ( \alpha / 2 p )\)

To determine the t -value, enter the equation TINV(α / p, df)

 

To determine the t- value, create t2= tinv(\(1 - \alpha / 2p,\ n-1\))

Multivariate Simultaneous Intervals

\(\sqrt{\frac{p(n-1)}{n-p}F_{p,n-p}(\alpha)}\)

 

To determine the F value, enter the equation FINV(\(\alpha,\text{num df, denom df})\)

 

To determine the F- value, create

F= finv(\(1 - \alpha, \text{p, n-p})\)

Example 5-4

This example uses the dataset that includes mineral content measurements at three different arm bone locations for n = 25 women. We’ll determine confidence intervals for the three different population means. Sample means and standard deviations for the three variables are:

Dataset: mineral.csv

Simple Statistics
Variable N Mean Std Dev
domradius 25 0.84380 0.11402
domhumerus 25 1.79268 0.28347
domulna 25 0.70440

0.10756

Click to expand the solution using each method.

We’ll use a .95 confidence level for each interval. With n = 25, df = 24 and \(t _ { 24 } ( .025 ) = 2.064\). This can be found in Excel as =TINV(.05,24).

The confidence intervals have the form \(\bar{x}_j \pm 2.064\dfrac{s_j}{\sqrt{n}}\). Intervals are the following.

  • For dominant radius:

    \(0.84380 \pm 2.064 \dfrac{0.11402}{\sqrt{25}}\) which is 0.797 to 0.891

  • For dominant humerus:

    \(1.79268 \pm 2.064 \dfrac{0.28347}{\sqrt{25}}\) which is 1.676 to 1.910

  • For dominant ulna:

    \(0.70440 \pm 2.064\dfrac{0.10576}{\sqrt{25}}\) which is 0.660 to 0.749

We’ll use a .95 confidence family-wide level so the family error = .05. For each interval, the error rate = .05/3 = 0.016… The multiplier is \(t _ { 24 } ( .008333 ) = 2.574\) which can be found in Excel as =TINV(.05/3,24).

The confidence intervals have the form \(\bar{x}_j \pm 2.574\dfrac{s_j}{\sqrt{n}}\). Intervals are the following.

  • For dominant radius:

    \(0.84380 \pm 2.574 \dfrac{0.11402}{\sqrt{25}}\) which is 0.785 to 0.903

  • For dominant humerus:

    \(1.79268 \pm 2.574 \dfrac{0.28347}{\sqrt{25}}\) which is 1.647 to 1.939

  • For dominant ulna:

    \(0.70440 \pm 2.574 \dfrac{0.10576}{\sqrt{25}}\) which is 0.649 to 0.760

The necessary F value is \(\sqrt{\dfrac{3(25-1)}{25-3}3.049} = 3.159\). (See Example 3 above for details)

The confidence intervals have the form \(\bar{x}_j \pm 3.159 \dfrac{s_j}{\sqrt{n}}\). Intervals are the following.

  • For dominant radius:

    \(0.84380 \pm 3.159 \dfrac{0.11402}{\sqrt{25}}\) which is 0.772 to 0.916

  • For dominant humerus:

    \(1.79268 \pm 3.159 \dfrac{0.28347}{\sqrt{25}}\) which is 1.614to 1.972

  • For dominant ulna:

    \(0.70440 \pm 3.159 \dfrac{0.10576}{\sqrt{25}}\) which is 0.636 to 0.773

Steve Rathbun, formerly of Penn State, wrote the following SAS code (download below) to generate confidence intervals for population means using the three methods discussed here. The code reads a dataset, reshapes it to have a data line for each variable value, determines means and standard deviations, and then calculates and prints the three types of intervals. To use this code for different situations, you need only to change the third line where the value of p is set and the data step where the data set is read and reshaped.

The output for the program just given is below. It includes the sample mean and variance for each variable and the three confidence intervals. Limits for the one-at-a-time intervals are given as loone and upone. Limits for the Bonferroni method are given as lobon and upbon. Limits for the simultaneous confidence region method are given as losim and upsim.

 
Obs variable _TYPE_ _FREQ_ n xbar s2 t1 tb f loone upone lobon upbon losim upsim
1 domhumeru 0 25 25 1.79268 0.080357 2.06390 2.57364 3.04912 1.67567 1.90969 1.64677 1.93859 1.61358 1.97178
2 domradius 0 25 25 0.84380 0.013002 2.06390 2.57364 3.04912 0.79673 0.89087 0.78511 0.90249 0.77176 0.91584
3 dumulna 0 25 25 0.70440 0.011568 2.06390 2.57364 3.04912 0.66000 0.74880 0.64904 0.75976 0.63645 0.77235

Walkthrough of the Three Methods using Minitab

To calculate 95% one-at-a-time confidence intervals:

  1. Open the ‘mineral’ data set in a new worksheet.
  2. Calc > Basic Statistics > 1-sample t
    1. Choose ‘One or more sample’ in the first window.
    2. Highlight and select ‘domradius’ and any other variables of interest to move them into the window on the right.
    3. Under ‘Options’, choose 95.0 for the confidence level and select Mean not equal to hypothesized mean.
  3. Select ‘OK’ twice. The one-at-a-time intervals are displayed in the results area.

To calculate simultaneous 95% confidence intervals via the Bonferroni method:

  1. Open the ‘mineral’ data set in a new worksheet.
  2. Calc > Basic Statistics > 1-sample t
    1. Choose ‘One or more sample’ in the first window.
    2. Highlight and select ‘domradius’, ‘domhumerus’, and ‘domulna’ to move them into the window on the right.
    3. Under ‘Options’, enter 0.9833, which corresponds to 1-0.05/3, the adjusted individual confidence level for simultaneous 95% confidence with the Bonferroni method.
    4. Select Mean not equal to hypothesized mean.
  3. Select ‘OK’ twice. The 95% Bonferroni intervals are displayed in the results area.

To calculate simultaneous 95% confidence intervals with the F-multipliers (based on the T-squared distribution):

  1. Open the ‘mineral’ data set in a new worksheet.
  2. Find the mean, standard deviation, and sample size needed for the calculations.
    1. Stat > Basic Statistics > Display Basic Statistics
    2. Highlight and select ‘domradius’, ‘domhumerus’, and ‘domulna’ to move them into the window on the right.
    3. Under ‘Statistics’, choose the mean, standard deviation, and ‘N nonmissing’.
    4. Select ‘OK’ twice. The statistics are displayed in the results area.
  3. Find the F-multiplier for simultaneous 95% confidence.
    1. Calc > Probability Distributions > Inverse Cumulative Distribution Function
    2. Choose ‘A single value’ and enter 0.95 in the ‘Value’ window.
    3. Select ‘F distribution’, and enter 3 (the number of variables) and 22 (sample size minus the number of variables) for the numerator and denominator degrees of freedom, respectively
    4. Select ‘Display a table of inverse cumulative probabilities’
    5. Select ‘OK’. The F-multiplier 3.049 is displayed in the results area.
  4. Create two new columns in the worksheet labeled ‘losim’ and ‘upsim’
  5. Calc > Calculator
    1. Highlight and select ‘losim’ to move it to the first window.
    2. In the Expression window, enter the formula 0.8438 - 0.1140/sqrt(25) * sqrt(3*24*3.049) for the lower confidence interval limit for domradius. Note that the values 0.8438, 0.1140, and 25 are the mean, standard deviation, and sample size obtained above.
    3. Select ‘OK’. The lower confidence limit is displayed in the worksheet under ‘losim’.
    4. Repeat sub-steps 3. and 4. above but use the formula 0.8438 + 0.1140/sqrt(25) * sqrt(3*24*3.049) for the upper limit, and choose ‘upsim’ for the result location.
  6. Repeat steps 4. and 5. above for domhumerus and domulna by substituting the corresponding means and standard deviations into the expressions.

5.3 - Inferences for Correlations

5.3 - Inferences for Correlations

Let us consider testing the null hypothesis that there is zero correlation between two variables \(X_{j}\) and \(X_{k}\). Mathematically we write this as shown below:

\(H_0\colon \rho_{jk}=0\) against \(H_a\colon \rho_{jk} \ne 0 \)

Recall that the correlation is estimated by sample correlation \(r_{jk}\) given in the expression below:

\(r_{jk} = \dfrac{s_{jk}}{\sqrt{s^2_js^2_k}}\)

Here we have the sample covariance between the two variables divided by the square root of the product of the individual variances.

We shall assume that the pair of variables \(X_{j}\)and \(X_{k}\) are independently sampled from a bivariate normal distribution throughout this discussion; that is:

\(\left(\begin{array}{c}X_{1j}\\X_{1k} \end{array}\right)\), \(\left(\begin{array}{c}X_{2j}\\X_{2k} \end{array}\right)\), \(\dots\), \(\left(\begin{array}{c}X_{nj}\\X_{nk} \end{array}\right)\)

are independently sampled from a bivariate normal distribution.

To test the null hypothesis, we form the test statistic, t as below

\(t = r_{jk}\sqrt{\frac{n-2}{1-r^2_{jk}}}\)  \(\dot{\sim}\)  \( t_{n-2}\)

Under the null hypothesis, \(H_{o}\), this test statistic will be approximately distributed as t with n - 2 degrees of freedom.

Note! This approximation holds for larger samples. We will reject the null hypothesis, \(H_{o}\), at level \(α\) if the absolute value of the test statistic, t, is greater than the critical value from the t-table with n - 2 degrees of freedom; that is if:

\(|t| > t_{n-2, \alpha/2}\)

To illustrate these concepts let's return to our example dataset, the Wechsler Adult Intelligence Scale.

Example 5-5: Wechsler Adult Intelligence Scale

This data was analyzed using the SAS program in our last lesson, (Multivariate Normal Distribution), which yielded the computer output below.

Download the:

Find the Total Variance of the Wechsler Adult Intelligence Scale Data

To find the correlation matrix:

  1. Open the ‘wechsler’ data set in a new worksheet
  2. Stat > Basic Statistics > Correlation
  3. Highlight and select ‘info’, ‘sim’, ‘arith’, and ‘pict’ to move them into the variables window
  4. Select ‘OK’. The matrix of correlations, along with scatterplots, is displayed in the results area

Recall that these are data on n = 37 subjects taking the Wechsler Adult Intelligence Test. This test was broken up into four components:

  • Information
  • Similarities
  • Arithmetic
  • Picture Completion

Looking at the computer output we have summarized the correlations among variables in the table below:

 
Information
Similarities
Arithmetic
Picture
Information
1.00000
0.77153
0.56583
0.31816
Similarities
0.77153
1.00000
0.51295
0.08135
Arithmetic
0.56583
0.51295
1.00000
0.27988
Picture
0.31816
0.08135
0.27988
1.00000

For example, the correlation between Similarities and Information is 0.77153.

Let's consider testing the null hypothesis that there is no correlation between Information and Similarities. This would be written mathematically as shown below:

\(H_0\colon \rho_{12}=0\)

We can then substitute values into the formula to compute the test statistic using the values from this example:

\begin{align} t &= r_{jk}\sqrt{\frac{n-2}{1-r^2_{jk}}}\\[10pt] &= 0.77153 \sqrt{\frac{37-2}{1-0.77153^2}}\\[10pt] &= 7.175 \end{align}

Looking at our t-table for 35 degrees of freedom and an \(\alpha\) level of .005, we get a critical value of \(t _ { ( d f , 1 - \alpha / 2 ) } = t _ { 35,0.9975 } = 3.030\). Therefore, we are going to look at the critical value under 0.0025 in the table (since 35 does not appear to use the closest df that does not exceed 35 which is 30) and in this case it is 3.030, meaning that \(t _ { ( d f , 1 - \alpha / 2 ) } = t _ { 35,0.9975 } = 3.030\) is close to 3.030.

Note! Some text tables provide the right tail probability (the graph at the top will have the area in the right tail shaded in) while other texts will provide a table with the cumulative probability - the graph will be shaded into the left. The concept is the same. For example, if the alpha was 0.01 then using the first text you would look under 0.005, and in the second text look under 0.995.

 Because

\(7.175 > 3.030 = t_{35, 0.9975}\),

we can reject the null hypothesis that Information and Similarities scores are uncorrelated at the \(\alpha\) < 0.005 level.

Our conclusion is that Similarity scores increase with increasing Information scores (t = 7.175; d.f. = 35; p < 0.0001). You will note here that we are not simply concluding that the results are significant. When drawing conclusions it is never adequate to simply state that the results are significant. In all cases, you should seek to describe what the results tell you about this data. In this case, because we rejected the null hypothesis we can conclude that the correlation is not equal to zero.  Furthermore, because the actual sample correlation is greater than zero and our p-value is so small, we can conclude that there is a positive association between the two variables. Hence, our conclusion is that Similarity scores tend to increase with increasing values of Information scores.

You will also note that the conclusion includes information from the test. You should always back up your findings with the appropriate evidence: the test statistic, degrees of freedom (if appropriate), and p-value. Here the appropriate evidence is given by the test statistic t = 7.175; the degrees of freedom for the test, 35, and the p-value, less than 0.0001 as indicated by the computer printout. The p-value appears below each correlation coefficient in the SAS output.

Confidence Interval for \(p_{jk}\)

Once we conclude that there is a positive or negative correlation between two variables the next thing we might want to do is compute a confidence interval for the correlation. This confidence interval will give us a range of reasonable values for the correlation itself. The sample correlation, because it is bounded between -1 and 1 is typically not normally distributed or even approximately so. If the population correlation is near zero, the distribution of sample correlations may be approximately bell-shaped in distribution around zero. However, if the population correlation is near +1 or -1, the distribution of sample correlations will be skewed. For example, if \(p_{jk}= .9\), the distribution of sample correlations will be more concentrated near .9.  Because they cannot exceed 1, they have more room to spread out to the left of .9, which causes a left-skewed shape. To adjust for this asymmetry or the skewness of distribution, we apply a transformation of the correlation coefficients. In particular, we are going to apply Fisher's transformation which is given in the expression below in Step 1 of our procedure for computing confidence intervals for the correlation coefficient.

Steps

  1. Compute Fisher’s transformation

    \(z_{jk}=\frac{1}{2}\log\dfrac{1+r_{jk}}{1-r_{jk}}\)

    Here we have one-half of the natural log of 1 plus the correlation, divided by one minus the correlation.

    Note! In this course, whenever log is mentioned, unless specified otherwise, log stands for the natural log.

    For large samples, this transform correlation coefficient z is going to be approximately normally distributed with the mean equal to the same transformation of the population correlation, as shown below, and a variance of 1 over the sample size minus 3.

    \(z_{jk}\) \(\dot{\sim}\) \(N\left(\dfrac{1}{2}\log\dfrac{1+\rho_{jk}}{1-\rho_{jk}}, \dfrac{1}{n-3}\right)\)

  2. Compute a (1 - \(\alpha\)) x 100% confidence interval for the Fisher transform of the population correlation.

    \(\dfrac{1}{2}\log \dfrac{1+\rho_{jk}}{1-\rho_{jk}}\)

    That is one-half log of 1 plus the correlation divided by 1 minus the correlation. In other words, this confidence interval is given by the expression below:

    \(\left(\underset{Z_l}{\underbrace{Z_{jk}-\frac{Z_{\alpha/2}}{\sqrt{n-3}}}}, \underset{Z_U}{\underbrace{Z_{jk}+\frac{Z_{\alpha/2}}{\sqrt{n-3}}}}\right)\)

    Here we take the value of Fisher's transform Z, plus and minus the critical value from the z table, divided by the square root of n - 3. The lower bound we will call the \(Z_{1}\) and the upper bound we will call the \(Z_{u}\).

  3. Back transform the confidence values to obtain the desired confidence interval for \(\rho_{jk}\) This is given in the expression below:

    \(\left(\dfrac{e^{2Z_l}-1}{e^{2Z_l}+1},\dfrac{e^{2Z_U}-1}{e^{2Z_U}+1}\right)\)

    The first term we see is a function of the lower bound, the \(Z_{1}\). The second term is a function of the upper bound or \(Z_{u}\).

Let's return to the Wechsler Adult Intelligence Data to see how these procedures are carried out.

Example 5-6: Wechsler Adult Intelligence Data

Recall that the sample correlation between Similarities and Information was \(r_{12} = 0.77153\).

Step 1: Compute the Fisher transform:

\begin{align} Z_{12} &= \frac{1}{2}\log \frac{1+r_{12}}{1-r_{12}}\\[5pt] &= \frac{1}{2}\log\frac{1+0.77153}{1-0.77153}\\[5pt] &= 1.024 \end{align}

You should confirm this value on your own.

Step 2: Next, compute the 95% confidence interval for the Fisher transform, \(\frac{1}{2}\log \frac{1+\rho_{12}}{1-\rho_{12}}\) :

\begin{align} Z_l &=  Z_{12}-Z_{0.025}/\sqrt{n-3} \\ &= 1.024 - \frac{1.96}{\sqrt{37-3}} \\ &= 0.6880 \end{align}

\begin{align} Z_U &=  Z_{12}+Z_{0.025}/\sqrt{n-3} \\&= 1.024 + \frac{1.96}{\sqrt{37-3}} \\&= 1.3602 \end{align}

In other words, the value 1.024 plus or minus the critical value from the normal table, at \(α/2 = 0.025\), which in this case is 1.96. Divide by the square root of n minus 3. Subtracting the result from 1.024 yields a lower bound of 0.6880. Adding the result to 1.024 yields the upper bound of 1.3602.

Step 3: Carry out the back-transform to obtain the 95% confidence interval for ρ12. This is shown in the expression below:

\(\left(\dfrac{\exp\{2Z_l\}-1}{\exp\{2Z_l\}+1},\dfrac{\exp\{2Z_U\}-1}{\exp\{2Z_U\}+1}\right)\) 

\(\left(\dfrac{\exp\{2 \times 0.6880\}-1}{\exp\{2 \times 0.6880\}+1},\dfrac{\exp\{2\times 1.3602\}-1}{\exp\{2\times 1.3602\}+1}\right)\)

\((0.5967,0.8764)\)

This yields the interval from 0.5967 to 0.8764.

Conclusion: In this case, we can conclude that we are 95% confident that the interval (0.5967, 0.8764) contains the correlation between Information and Similarities scores.

Note! The interpretation of this interval. We did not say that with a 95% probability, the correlation between Information and Similarities lies between the interval. This statement would imply that the population correlation is random. In fact, the population correlation is a fixed quantity. The only quantities that are random are the bounds of the confidence interval, which are a function of the random data and so are also random---before the data is observed. After the data is observed and the interval is computed, there is no longer any random quantity. Therefore, we use the word "confidence", rather than "probability", to describe the parameter falling inside the interval.

5.4 - Summary

5.4 - Summary

In this lesson we learned about:

  • The statistical properties of the sample mean vector, including its variance-covariance matrix and its distribution;
  • The multivariate central limit theorem; which states that the sample mean vector will be approximately normally distributed;
  • Construction of confidence intervals one at a time and simultaneously;
  • How to test the hypothesis that the population correlation between two variables is equal to zero, and how to draw conclusions regarding the results of that hypothesis test;
  • How to compute confidence intervals for the correlation, and what conclusions can be drawn from such intervals.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility