Lesson 5: Sample Mean Vector and Sample Correlation and Related Inference Problems
Lesson 5: Sample Mean Vector and Sample Correlation and Related Inference ProblemsOverview
In this lesson, we consider the properties of the sample mean vector and the sample correlations which we had defined earlier. We will also consider estimation and hypothesis testing problems on the population mean and correlation coefficients.
This lesson will explore the following questions...
Sample Mean Vectors
 What is the distribution of \(\bar{x}\) when data come from a multivariate normal distribution?
 What are the properties of \(\bar{x}\) when data do NOT come from a multivariate normal distribution but the sample size n is large
 How to construct a confidence interval for a single multivariate normal population mean
 How to construct confidence intervals for several multivariate normal population means simultaneously
Sample Correlations
 How can we test the null hypothesis that there is zero correlation between two variables?
 What can we conclude from such hypothesis tests?
 How can we assess uncertainty regarding estimated correlation coefficients using confidence intervals?
 What is the appropriate interpretation for confidence intervals?
Objectives
 Describe the distribution of \(\bar{x}\) when data come from a Multivariate Normal distribution
 Construct simultaneous confidence intervals for several multivariate normal population means
 Conduct hypothesis testing on the correlation between two variables
 Assess confidence intervals for correlation between two variables
5.1  Distribution of Sample Mean Vector
5.1  Distribution of Sample Mean VectorAs noted previously \(\bar{\textbf{x}}\) is a function of random data, and hence \(\bar{\textbf{x}}\) is also a random vector with a mean, a variancecovariance matrix, and a distribution. We have already seen that the mean of the sample mean vector is equal to the population mean vector \(\boldsymbol{\mu}\).
Variance
Before considering the sample variancecovariance matrix for the mean vector \(\bar{\textbf{x}}\), let us revisit the univariate setting.
Univariate Setting
You should recall from introductory statistics that the population variance of the sample mean, generated from independent samples of size n, is equal to the population variance, \(\sigma^{2}\) divided by n.
\(\text{var}(\bar{x}) = \dfrac{\sigma^2}{n}\)
This, of course, is a function of the unknown population variance \(\sigma^{2}\). We can estimate this by simply substituting \(s^{2}\) in the sample variance \(\sigma^{2}\) yielding our estimate for the variance of the population mean as shown below:
\(\widehat{\text{var}}(\bar{x}) = \dfrac{s^2}{n}\)
If we were to take the square root of this quantity we would obtain the standard error of the mean. The standard error of the mean is a measure of the uncertainty of our estimate of the population mean. If the standard error is large, then we are less confident of our estimate of the mean. Conversely, if the standard error is small, then we are more confident in our estimate. What is meant by large or small depends on the application at hand. But in any case, because the standard error is a decreasing function of sample size, the larger our sample the more confident we can be of our estimate of the population mean.
Multivariate Setting
The population variancecovariance matrix replaces the variance of the \(\bar{x}\)’s generated from independent samples of size n, taking a form similar to the univariate setting as shown below:
\(\text{var}(\bar{\textbf{x}}) = \dfrac{1}{n}\Sigma\)
Again, this is a function of the unknown population variancecovariance matrix \(\Sigma\). An estimate of the variancecovariance matrix of \(\bar{\textbf{x}}\) can be obtained by substituting the sample variancecovariance matrix S for the population variancecovariance matrix \(\Sigma\), yielding the estimate as shown below:
\(\widehat{\text{var}}(\bar{\textbf{x}}) = \dfrac{1}{n}\textbf{S}\)
Distribution
Let's consider the distribution of the sample mean vector, first looking at the univariate setting and comparing this to the multivariate setting.
Univariate Setting
Here we are going to make the additional assumption that \(X _ { 1 } , X _ { 2 , \dots } X _ { n } \) are independently sampled from a normal distribution with mean \(\mu\) and variance \(\sigma_{2}\). In this case, \(\bar{x}\) is normally distributed as
\(\bar{x} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\)
Multivariate Setting
Similarly, for the multivariate setting, we are going to assume that the data vectors \(\boldsymbol{X _ { 1 }, X _ { 2, \dots } X _ { n }} \) are independently sampled from a multivariate normal distribution with mean vector \(\boldsymbol{\mu}\) and variancecovariance matrix \(\Sigma\). Then, in this case, the sample mean vector, \(\bar{\textbf{x}}\), is distributed as multivariate normal with mean vector \(\boldsymbol{\mu}\) and variancecovariance matrix \(\frac{1}{n}\Sigma\), the variancecovariance matrix for \(\bar{\textbf{x}}\). In statistical notation we write:
\(\bar{\textbf{x}} \sim N \left( \boldsymbol {\mu}, \frac{1}{n}\Sigma \right)\)
Law of Large Numbers
At this point, we will drop the assumption that the individual observations are sampled from a normal distribution and look at the laws of large numbers. These will hold regardless of the distribution of the individual observations.
Univariate Setting
In the univariate setting, we see that if the data are independently sampled, then the sample mean, \(\bar{x}\), is going to converge (in probability) to the population mean \(\mu\). What does this mean exactly? It means that as the sample size gets larger and larger the sample mean will tend to approach the true value for a population \(\mu\).
Multivariate Setting
A similar result is involved in the multivariate setting, the sample mean vector, \(\bar{\textbf{x}}\), will also converge (in probability) to the mean vector \(\boldsymbol{\mu}\) As our sample size gets larger and larger, each of the individual components of that vector, \(\bar{x}_{j}\), will converge to the corresponding mean, \(\mu_{j}\).
\(\bar{x}_j \stackrel{p}\rightarrow \mu_j\)
Central Limit Theorem
Just as in the univariate setting we also have a multivariate Central Limit Theorem. But first, let's review the univariate Central Limit Theorem.
Univariate Setting
If all of our individual observations, \(X _ { 1 } , X _ { 2 , \dots } X _ { n }\), are independently sampled from a population with mean \(\mu\) and variance \(\sigma_{2}\), then, the sample mean, \(\bar{x}\), is approximately normally distributed with mean \(\mu\) and variance \(\sigma_{2}\).
Multivariate Setting
A similar result is available in the multivariate setting. If our data vectors \(\boldsymbol{X _ { 1 }, X _ { 2 , \dots } X _ { n }}\) , are independently sampled from a population with mean vector \(\boldsymbol{\mu}\) and variancecovariance matrix \(\Sigma\), then the sample mean vector, \(\bar{\textbf{x}}\), is going to be approximately normally distributed with mean vector \(\boldsymbol{\mu}\) and variancecovariance matrix \(\frac{1}{n}\Sigma\).
This Central Limit Theorem is a key result that we will take advantage of later on in this course when we talk about hypothesis tests for individual mean vectors or collections of mean vectors under different treatment regimens.
5.2  Interval Estimate of Population Mean
5.2  Interval Estimate of Population MeanHere we consider the joint estimation of a multivariate set of population means. That is, we have observed a set of p Xvariables and may wish to estimate the population mean for each variable. In some instances, we may also want to estimate one or more linear combinations of population means. Our basic tool for estimating the unknown value of a population parameter is a confidence interval, an interval of values that is likely to include the unknown value of the parameter.
 General Format for a Confidence Interval

The general format of a confidence interval estimate of a population mean is:

\(\text{Sample mean} \pm \text{Multiplier × Standard error of mean}\)

For variable \(X_{j}\), a confidence interval estimate of its population mean \(\mu_{j}\) is

\(\bar{x}_j \pm \text{Multiplier}\dfrac{s_j}{\sqrt{n}}\)
In this formula, \(\bar{x}_{j}\) is the sample mean, \(s_{j}\) is the sample standard deviation and n is the sample size. The multiplier value is a function of the confidence level, the sample size, and the strategy used for dealing with the multiple inference issue.
Strategies for Determining the Multiplier
The following list covers some common strategies:
 OneataTime Confidence Intervals: This strategy essentially considers each mean separately and uses the desired confidence level (usually 95%) for every single interval.
 Bonferroni Method: With this method, we set a familywide error rate and then divide this family error rate by the number of intervals to be computed to determine the error rate (and hence confidence level) for each individual interval.
 Simultaneous Confidence Region: This strategy uses properties of the multivariate normal distribution to define joint confidence intervals. The multiplier for this method is conservative because the family error rate applies to the family of all possible linear combinations of population means.
One at a Time Intervals
For a \(1  \alpha\) confidence interval, the “one at a time” multiplier is the tvalue such that the probability is \(1  \alpha\) between –t and +t under a tdistribution with n  1 degrees of freedom. Said another way, the value of t is such that the probability greater than +t is \(\alpha/2\).
Notationally, the oneatatime multiplier is:
\(\text{Multiplier} = t_{n1}(\alpha/2)\)
With this notation, a confidence interval for \(\mu_{j}\) is computed as:
\(\bar{x}_j \pm t_{n1}(\alpha/2)\frac{s_j}{\sqrt{n}}\)
Example 51: One at a Time Intervals
Suppose that the sample size is n = 25 and we want a 95% confidence interval for the population mean. Thus \(\alpha = 0.05\). Our textbook would write the multiplier as \(t_{24}(.025)\). In Excel, the command =TINV(.05,24) will give the multiplier (value = 2.064). In SAS, a command such as t1=tinv(.975,24) will make the variable t1 that contains the desired multiplier.
Bonferroni Method Multiplier
When we determine confidence intervals for the population means of several variables, we are creating a family of confidence intervals. The familywide error rate is the probability that at least one of the confidence intervals in the family will not capture the population mean. The familywide confidence level = 1 – familywide error rate.
Suppose that we have a family of p confidence intervals and the error rates for the individual intervals are \(\alpha _ { 1 }, \alpha _ { 2 }, \dots , \alpha _ { p }\). The Bonferroni Inequality states that the family wideerror rate is less than or equal to the sum of \(\alpha _ { 1 }, \alpha _ { 2 }, \dots , \alpha _ { p }\). That is familywide error rate \(\leq \Sigma \alpha _ { i }\). In terms of the familywide confidence that all intervals capture their population means, we can write this as \(1  \Sigma \alpha _ { i } \leq\) familywide confidence level.
Most often, we divide the desired familywide error rate equally across the intervals that we will compute. If we are computing p confidence intervals with a desired familywide confidence level of \(\alpha\), we use an error rate of \(\alpha / p\) (so confidence \(= 1  (\alpha / p)\) for each individual interval. This guarantees that the familywide confidence level will be greater than or equal to \(1  \alpha\).
Suppose that we are calculating p intervals with a family error rate equal to \(\alpha\).
Notationally, the Bonferroni method multiplier is:
\(\text{Multiplier} = t_{n1}(\alpha/2p)\)
A confidence interval for\(\mu_{j}\) is computed as:
\(\bar{x}_j \pm t_{n1}(\alpha/2p)\frac{s_j}{\sqrt{n}}\)
Example 52: Bonferroni Method Multiplier
Suppose that n = 25. The familywide error = 5% for a family confidence = 95%. We are computing intervals for p = 5 means. The error rate for each interval will be .05/5 = 1%. We might use the Excel command = TINV(.01,24) to find that the multiplier = 2.797. In SAS, we use the cumulative probability \(= 1 \alpha /2p\) so the command for finding the tmultiplier in this instance is something like t1=tinv(.995, 24).
Simultaneous Confidence Region Multiplier
This method is derived from the properties of the multivariate normal distribution. The multiplier applies to the family of all possible linear combinations of the population means considered, including the individual means. It is conservative (meaning that the multiplier tends to be larger than absolutely necessary). When family confidence is used, compare the value of this multiplier to the Bonferroni method multiplier and use the smaller of the two.
Notationally, the simultaneous confidence region multiplier is:
\(\text{Multiplier}=\sqrt{\frac{p(n1)}{np}F_{p,np}(\alpha)}\)
\(F _ { p , n  p } ( \alpha )\) represents a value of F such that the probability greater than this value is α under an Fdistribution with p and n  p degrees of freedom.
Example 53: Simultaneous Confidence Region Multiplier
Suppose that we have a sample size of n = 25 and we have p = 3 variables. With a 5% family error rate (and 95% family confidence), the Fvalue can be found in Excel using = FINV(.05, 3, 22) = 3.049. SAS uses cumulative probabilities so in this case, a command like f1= FINV(.95,3, 22) would make f1 the Fvalue. The multiplier in this example is
\(\sqrt{\frac{3(251)}{253}3.049}=3.159\)
This multiplier could be used for all confidence intervals for parameters that are linear combinations of the three population means (and for the three individual means).
Summary of Multipliers
The following table summarizes the three different multipliers and gives notes about using Excel and SAS.
Method  Textbook notation for multiplier  Excel notes  SAS notes 

One at a time: Confidence = (\(1  \alpha)\) for each interval  \(t _ { n  1 } ( \alpha / 2 )\)  To determine the t value, enter the equation TINV(\(\alpha, \text{df}\)) 
To determine the tvalue, create t1= tinv(\(1  \alpha/2,\ n1\)) 
Bonferroni Method: Confidence = \(1  \alpha)\) for whole family 
\(t _ { n  1 } ( \alpha / 2 p )\) 
To determine the t value, enter the equation TINV(α / p, df) 
To determine the t value, create t2= tinv(\(1  \alpha / 2p,\ n1\)) 
Multivariate Simultaneous Intervals 
\(\sqrt{\frac{p(n1)}{np}F_{p,np}(\alpha)}\) 
To determine the F value, enter the equation FINV(\(\alpha,\text{num df, denom df})\) 
To determine the F value, create F= finv(\(1  \alpha, \text{p, np})\) 
Example 54
This example uses the dataset that includes mineral content measurements at three different arm bone locations for n = 25 women. We’ll determine confidence intervals for the three different population means. Sample means and standard deviations for the three variables are:
Dataset: mineral.csv
Variable  N  Mean  Std Dev 

domradius  25  0.84380  0.11402 
domhumerus  25  1.79268  0.28347 
domulna  25  0.70440 
0.10756 
Click to expand the solution using each method.
We’ll use a .95 confidence level for each interval. With n = 25, df = 24 and \(t _ { 24 } ( .025 ) = 2.064\). This can be found in Excel as =TINV(.05,24).
The confidence intervals have the form \(\bar{x}_j \pm 2.064\dfrac{s_j}{\sqrt{n}}\). Intervals are the following.
 For dominant radius:
\(0.84380 \pm 2.064 \dfrac{0.11402}{\sqrt{25}}\) which is 0.797 to 0.891
 For dominant humerus:
\(1.79268 \pm 2.064 \dfrac{0.28347}{\sqrt{25}}\) which is 1.676 to 1.910
 For dominant ulna:
\(0.70440 \pm 2.064\dfrac{0.10576}{\sqrt{25}}\) which is 0.660 to 0.749
We’ll use a .95 confidence familywide level so the family error = .05. For each interval, the error rate = .05/3 = 0.016… The multiplier is \(t _ { 24 } ( .008333 ) = 2.574\) which can be found in Excel as =TINV(.05/3,24).
The confidence intervals have the form \(\bar{x}_j \pm 2.574\dfrac{s_j}{\sqrt{n}}\). Intervals are the following.
 For dominant radius:
\(0.84380 \pm 2.574 \dfrac{0.11402}{\sqrt{25}}\) which is 0.785 to 0.903
 For dominant humerus:
\(1.79268 \pm 2.574 \dfrac{0.28347}{\sqrt{25}}\) which is 1.647 to 1.939
 For dominant ulna:
\(0.70440 \pm 2.574 \dfrac{0.10576}{\sqrt{25}}\) which is 0.649 to 0.760
The necessary F value is \(\sqrt{\dfrac{3(251)}{253}3.049} = 3.159\). (See Example 3 above for details)
The confidence intervals have the form \(\bar{x}_j \pm 3.159 \dfrac{s_j}{\sqrt{n}}\). Intervals are the following.
 For dominant radius:
\(0.84380 \pm 3.159 \dfrac{0.11402}{\sqrt{25}}\) which is 0.772 to 0.916
 For dominant humerus:
\(1.79268 \pm 3.159 \dfrac{0.28347}{\sqrt{25}}\) which is 1.614to 1.972
 For dominant ulna:
\(0.70440 \pm 3.159 \dfrac{0.10576}{\sqrt{25}}\) which is 0.636 to 0.773
Steve Rathbun, formerly of Penn State, wrote the following SAS code (download below) to generate confidence intervals for population means using the three methods discussed here. The code reads a dataset, reshapes it to have a data line for each variable value, determines means and standard deviations, and then calculates and prints the three types of intervals. To use this code for different situations, you need only to change the third line where the value of p is set and the data step where the data set is read and reshaped.
 Dataset: mineral.csv
 Download the SAS program here: CI_pop_means.sas
The output for the program just given is below. It includes the sample mean and variance for each variable and the three confidence intervals. Limits for the oneatatime intervals are given as loone and upone. Limits for the Bonferroni method are given as lobon and upbon. Limits for the simultaneous confidence region method are given as losim and upsim.
Obs  variable  _TYPE_  _FREQ_  n  xbar  s2  t1  tb  f  loone  upone  lobon  upbon  losim  upsim 

1  domhumeru  0  25  25  1.79268  0.080357  2.06390  2.57364  3.04912  1.67567  1.90969  1.64677  1.93859  1.61358  1.97178 
2  domradius  0  25  25  0.84380  0.013002  2.06390  2.57364  3.04912  0.79673  0.89087  0.78511  0.90249  0.77176  0.91584 
3  dumulna  0  25  25  0.70440  0.011568  2.06390  2.57364  3.04912  0.66000  0.74880  0.64904  0.75976  0.63645  0.77235 
Walkthrough of the Three Methods using Minitab
To calculate 95% oneatatime confidence intervals:
 Open the ‘mineral’ data set in a new worksheet.
 Calc > Basic Statistics > 1sample t
 Choose ‘One or more sample’ in the first window.
 Highlight and select ‘domradius’ and any other variables of interest to move them into the window on the right.
 Under ‘Options’, choose 95.0 for the confidence level and select Mean not equal to hypothesized mean.
 Select ‘OK’ twice. The oneatatime intervals are displayed in the results area.
To calculate simultaneous 95% confidence intervals via the Bonferroni method:
 Open the ‘mineral’ data set in a new worksheet.
 Calc > Basic Statistics > 1sample t
 Choose ‘One or more sample’ in the first window.
 Highlight and select ‘domradius’, ‘domhumerus’, and ‘domulna’ to move them into the window on the right.
 Under ‘Options’, enter 0.9833, which corresponds to 10.05/3, the adjusted individual confidence level for simultaneous 95% confidence with the Bonferroni method.
 Select Mean not equal to hypothesized mean.
 Select ‘OK’ twice. The 95% Bonferroni intervals are displayed in the results area.
To calculate simultaneous 95% confidence intervals with the Fmultipliers (based on the Tsquared distribution):
 Open the ‘mineral’ data set in a new worksheet.
 Find the mean, standard deviation, and sample size needed for the calculations.
 Stat > Basic Statistics > Display Basic Statistics
 Highlight and select ‘domradius’, ‘domhumerus’, and ‘domulna’ to move them into the window on the right.
 Under ‘Statistics’, choose the mean, standard deviation, and ‘N nonmissing’.
 Select ‘OK’ twice. The statistics are displayed in the results area.
 Find the Fmultiplier for simultaneous 95% confidence.
 Calc > Probability Distributions > Inverse Cumulative Distribution Function
 Choose ‘A single value’ and enter 0.95 in the ‘Value’ window.
 Select ‘F distribution’, and enter 3 (the number of variables) and 22 (sample size minus the number of variables) for the numerator and denominator degrees of freedom, respectively
 Select ‘Display a table of inverse cumulative probabilities’
 Select ‘OK’. The Fmultiplier 3.049 is displayed in the results area.
 Create two new columns in the worksheet labeled ‘losim’ and ‘upsim’
 Calc > Calculator
 Highlight and select ‘losim’ to move it to the first window.
 In the Expression window, enter the formula 0.8438  0.1140/sqrt(25) * sqrt(3*24*3.049) for the lower confidence interval limit for domradius. Note that the values 0.8438, 0.1140, and 25 are the mean, standard deviation, and sample size obtained above.
 Select ‘OK’. The lower confidence limit is displayed in the worksheet under ‘losim’.
 Repeat substeps 3. and 4. above but use the formula 0.8438 + 0.1140/sqrt(25) * sqrt(3*24*3.049) for the upper limit, and choose ‘upsim’ for the result location.
 Repeat steps 4. and 5. above for domhumerus and domulna by substituting the corresponding means and standard deviations into the expressions.
5.3  Inferences for Correlations
5.3  Inferences for CorrelationsLet us consider testing the null hypothesis that there is zero correlation between two variables \(X_{j}\) and \(X_{k}\). Mathematically we write this as shown below:
\(H_0\colon \rho_{jk}=0\) against \(H_a\colon \rho_{jk} \ne 0 \)
Recall that the correlation is estimated by sample correlation \(r_{jk}\) given in the expression below:
\(r_{jk} = \dfrac{s_{jk}}{\sqrt{s^2_js^2_k}}\)
Here we have the sample covariance between the two variables divided by the square root of the product of the individual variances.
We shall assume that the pair of variables \(X_{j}\)and \(X_{k}\) are independently sampled from a bivariate normal distribution throughout this discussion; that is:
\(\left(\begin{array}{c}X_{1j}\\X_{1k} \end{array}\right)\), \(\left(\begin{array}{c}X_{2j}\\X_{2k} \end{array}\right)\), \(\dots\), \(\left(\begin{array}{c}X_{nj}\\X_{nk} \end{array}\right)\)
are independently sampled from a bivariate normal distribution.
To test the null hypothesis, we form the test statistic, t as below
\(t = r_{jk}\sqrt{\frac{n2}{1r^2_{jk}}}\) \(\dot{\sim}\) \( t_{n2}\)
Under the null hypothesis, \(H_{o}\), this test statistic will be approximately distributed as t with n  2 degrees of freedom.
Note! This approximation holds for larger samples. We will reject the null hypothesis, \(H_{o}\), at level \(α\) if the absolute value of the test statistic, t, is greater than the critical value from the ttable with n  2 degrees of freedom; that is if:
\(t > t_{n2, \alpha/2}\)
To illustrate these concepts let's return to our example dataset, the Wechsler Adult Intelligence Scale.
Example 55: Wechsler Adult Intelligence Scale
This data was analyzed using the SAS program in our last lesson, (Multivariate Normal Distribution), which yielded the computer output below.
Download the:

Dataset: wechsler.csv

SAS program: wechsler.sas

SAS Output: wechsler.lst
Find the Total Variance of the Wechsler Adult Intelligence Scale Data
To find the correlation matrix:
 Open the ‘wechsler’ data set in a new worksheet
 Stat > Basic Statistics > Correlation
 Highlight and select ‘info’, ‘sim’, ‘arith’, and ‘pict’ to move them into the variables window
 Select ‘OK’. The matrix of correlations, along with scatterplots, is displayed in the results area
Recall that these are data on n = 37 subjects taking the Wechsler Adult Intelligence Test. This test was broken up into four components:
 Information
 Similarities
 Arithmetic
 Picture Completion
Looking at the computer output we have summarized the correlations among variables in the table below:
Information

Similarities

Arithmetic

Picture



Information 
1.00000

0.77153

0.56583

0.31816

Similarities 
0.77153

1.00000

0.51295

0.08135

Arithmetic 
0.56583

0.51295

1.00000

0.27988

Picture 
0.31816

0.08135

0.27988

1.00000

For example, the correlation between Similarities and Information is 0.77153.
Let's consider testing the null hypothesis that there is no correlation between Information and Similarities. This would be written mathematically as shown below:
\(H_0\colon \rho_{12}=0\)
We can then substitute values into the formula to compute the test statistic using the values from this example:
\begin{align} t &= r_{jk}\sqrt{\frac{n2}{1r^2_{jk}}}\\[10pt] &= 0.77153 \sqrt{\frac{372}{10.77153^2}}\\[10pt] &= 7.175 \end{align}
Looking at our ttable for 35 degrees of freedom and an \(\alpha\) level of .005, we get a critical value of \(t _ { ( d f , 1  \alpha / 2 ) } = t _ { 35,0.9975 } = 3.030\). Therefore, we are going to look at the critical value under 0.0025 in the table (since 35 does not appear to use the closest df that does not exceed 35 which is 30) and in this case it is 3.030, meaning that \(t _ { ( d f , 1  \alpha / 2 ) } = t _ { 35,0.9975 } = 3.030\) is close to 3.030.
Note! Some text tables provide the right tail probability (the graph at the top will have the area in the right tail shaded in) while other texts will provide a table with the cumulative probability  the graph will be shaded into the left. The concept is the same. For example, if the alpha was 0.01 then using the first text you would look under 0.005, and in the second text look under 0.995.
Because
\(7.175 > 3.030 = t_{35, 0.9975}\),
we can reject the null hypothesis that Information and Similarities scores are uncorrelated at the \(\alpha\) < 0.005 level.
Our conclusion is that Similarity scores increase with increasing Information scores (t = 7.175; d.f. = 35; p < 0.0001). You will note here that we are not simply concluding that the results are significant. When drawing conclusions it is never adequate to simply state that the results are significant. In all cases, you should seek to describe what the results tell you about this data. In this case, because we rejected the null hypothesis we can conclude that the correlation is not equal to zero. Furthermore, because the actual sample correlation is greater than zero and our pvalue is so small, we can conclude that there is a positive association between the two variables. Hence, our conclusion is that Similarity scores tend to increase with increasing values of Information scores.
You will also note that the conclusion includes information from the test. You should always back up your findings with the appropriate evidence: the test statistic, degrees of freedom (if appropriate), and pvalue. Here the appropriate evidence is given by the test statistic t = 7.175; the degrees of freedom for the test, 35, and the pvalue, less than 0.0001 as indicated by the computer printout. The pvalue appears below each correlation coefficient in the SAS output.
Confidence Interval for \(p_{jk}\)
Once we conclude that there is a positive or negative correlation between two variables the next thing we might want to do is compute a confidence interval for the correlation. This confidence interval will give us a range of reasonable values for the correlation itself. The sample correlation, because it is bounded between 1 and 1 is typically not normally distributed or even approximately so. If the population correlation is near zero, the distribution of sample correlations may be approximately bellshaped in distribution around zero. However, if the population correlation is near +1 or 1, the distribution of sample correlations will be skewed. For example, if \(p_{jk}= .9\), the distribution of sample correlations will be more concentrated near .9. Because they cannot exceed 1, they have more room to spread out to the left of .9, which causes a leftskewed shape. To adjust for this asymmetry or the skewness of distribution, we apply a transformation of the correlation coefficients. In particular, we are going to apply Fisher's transformation which is given in the expression below in Step 1 of our procedure for computing confidence intervals for the correlation coefficient.
Steps
 Compute Fisher’s transformation
\(z_{jk}=\frac{1}{2}\log\dfrac{1+r_{jk}}{1r_{jk}}\)
Here we have onehalf of the natural log of 1 plus the correlation, divided by one minus the correlation.
Note! In this course, whenever log is mentioned, unless specified otherwise, log stands for the natural log.
For large samples, this transform correlation coefficient z is going to be approximately normally distributed with the mean equal to the same transformation of the population correlation, as shown below, and a variance of 1 over the sample size minus 3.
\(z_{jk}\) \(\dot{\sim}\) \(N\left(\dfrac{1}{2}\log\dfrac{1+\rho_{jk}}{1\rho_{jk}}, \dfrac{1}{n3}\right)\)

Compute a (1  \(\alpha\)) x 100% confidence interval for the Fisher transform of the population correlation.
\(\dfrac{1}{2}\log \dfrac{1+\rho_{jk}}{1\rho_{jk}}\)
That is onehalf log of 1 plus the correlation divided by 1 minus the correlation. In other words, this confidence interval is given by the expression below:
\(\left(\underset{Z_l}{\underbrace{Z_{jk}\frac{Z_{\alpha/2}}{\sqrt{n3}}}}, \underset{Z_U}{\underbrace{Z_{jk}+\frac{Z_{\alpha/2}}{\sqrt{n3}}}}\right)\)
Here we take the value of Fisher's transform Z, plus and minus the critical value from the z table, divided by the square root of n  3. The lower bound we will call the \(Z_{1}\) and the upper bound we will call the \(Z_{u}\).

Back transform the confidence values to obtain the desired confidence interval for \(\rho_{jk}\) This is given in the expression below:
\(\left(\dfrac{e^{2Z_l}1}{e^{2Z_l}+1},\dfrac{e^{2Z_U}1}{e^{2Z_U}+1}\right)\)
The first term we see is a function of the lower bound, the \(Z_{1}\). The second term is a function of the upper bound or \(Z_{u}\).
Let's return to the Wechsler Adult Intelligence Data to see how these procedures are carried out.
Example 56: Wechsler Adult Intelligence Data
Recall that the sample correlation between Similarities and Information was \(r_{12} = 0.77153\).
Step 1: Compute the Fisher transform:
\begin{align} Z_{12} &= \frac{1}{2}\log \frac{1+r_{12}}{1r_{12}}\\[5pt] &= \frac{1}{2}\log\frac{1+0.77153}{10.77153}\\[5pt] &= 1.024 \end{align}
You should confirm this value on your own.
Step 2: Next, compute the 95% confidence interval for the Fisher transform, \(\frac{1}{2}\log \frac{1+\rho_{12}}{1\rho_{12}}\) :
\begin{align} Z_l &= Z_{12}Z_{0.025}/\sqrt{n3} \\ &= 1.024  \frac{1.96}{\sqrt{373}} \\ &= 0.6880 \end{align}
\begin{align} Z_U &= Z_{12}+Z_{0.025}/\sqrt{n3} \\&= 1.024 + \frac{1.96}{\sqrt{373}} \\&= 1.3602 \end{align}
In other words, the value 1.024 plus or minus the critical value from the normal table, at \(α/2 = 0.025\), which in this case is 1.96. Divide by the square root of n minus 3. Subtracting the result from 1.024 yields a lower bound of 0.6880. Adding the result to 1.024 yields the upper bound of 1.3602.
Step 3: Carry out the backtransform to obtain the 95% confidence interval for ρ_{12}. This is shown in the expression below:
\(\left(\dfrac{\exp\{2Z_l\}1}{\exp\{2Z_l\}+1},\dfrac{\exp\{2Z_U\}1}{\exp\{2Z_U\}+1}\right)\)
\(\left(\dfrac{\exp\{2 \times 0.6880\}1}{\exp\{2 \times 0.6880\}+1},\dfrac{\exp\{2\times 1.3602\}1}{\exp\{2\times 1.3602\}+1}\right)\)
\((0.5967,0.8764)\)
This yields the interval from 0.5967 to 0.8764.
Conclusion: In this case, we can conclude that we are 95% confident that the interval (0.5967, 0.8764) contains the correlation between Information and Similarities scores.
5.4  Summary
5.4  SummaryIn this lesson we learned about:
 The statistical properties of the sample mean vector, including its variancecovariance matrix and its distribution;
 The multivariate central limit theorem; which states that the sample mean vector will be approximately normally distributed;
 Construction of confidence intervals one at a time and simultaneously;
 How to test the hypothesis that the population correlation between two variables is equal to zero, and how to draw conclusions regarding the results of that hypothesis test;
 How to compute confidence intervals for the correlation, and what conclusions can be drawn from such intervals.