18.1 - Pearson Correlation Coefficient

Correlation is a general method of analysis useful when studying possible association between two continuous or ordinal scale variables. Several measures of correlation exist. The appropriate type for a particular situation depends on the distribution and measurement scale of the data. Three measures of correlation are commonly applied in biostatistics and these will be discussed below.

Suppose that we have two variables of interest, denoted as X and Y, and suppose that we have a bivariate sample of size n:

\(\left(X_{1} , Y_{1} \right), \left(X_{2} , Y_{2} \right), \dots , \left(X_{n} , Y_{n} \right)\)

and we define the following statistics:

\(\bar{X}=\dfrac{1}{n}\sum_{i=1}^{n}X_i , S_{XX}=\dfrac{1}{n-1}\sum_{i=1}^{n}(X_i-\bar{X})^2\)

\(\bar{Y}=\dfrac{1}{n}\sum_{i=1}^{n}Y_i , S_{YY}=\dfrac{1}{n-1}\sum_{i=1}^{n}(Y_i-\bar{Y})^2\)

\(S_{XY}=\dfrac{1}{n-1}\sum_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})\)

These statistics above represent the sample mean for X, the sample variance for X, the sample mean for Y, the sample variance for Y, and the sample covariance between X and Y, respectively. These should be very familiar to you.

The sample Pearson correlation coefficient (also called the sample product-moment correlation coefficient) for measuring the association between variables X and Y is given by the following formula:

\(r_p=\dfrac{S_{XY}}{\sqrt{S_{XX}S_{YY}}}\)

The sample Pearson correlation coefficient, \(r_{p}\) , is the point estimate of the population Pearson correlation coefficient

\(\rho_p=\dfrac{\sigma_{XY}}{\sqrt{\sigma_{XX}\sigma_{YY}}}\)

The Pearson correlation coefficient measures the degree of linear relationship between X and Y and \(-1 ≤ r_{p} ≤ +1\), so that \(r_{p}\) is a "unitless" quantity, i.e., when you construct the correlation coefficient the units of measurement that are used cancel out. A value of +1 reflects perfect positive correlation and a value of -1 reflects perfect negative correlation.

For the Pearson correlation coefficient, we assume that both X and Y are measured on a continuous scale and that each is approximately normally distributed.

The Pearson correlation coefficient is invariant to location and scale transformations. This means that if every \(X_{i}\) is transformed to

\(X_{i} * = aX_{i} + b\)

and every \(Y_{i}\) is transformed to

\(Y_{i} * = cY_{i} + d\)

where \(a > 0, b, c > 0\), and d are constants, then the correlation between X and Y is the same as the correlation between \(X*\) and \(Y*\).

With SAS, PROC CORR is used to calculate \(r_{p}\). The output from PROC CORR includes summary statistics for both variables and the computed value of \(r_{p}\). The output also contains a p-value corresponding to the test of:

\(H_{0} : \rho_{p} = 0\) versus \(H_{0} : \rho_{p} ≠ 0\)

It should be noted that this statistical test generally is not very useful, and the associated p-value, therefore, should not be emphasized. What is more important is to construct a confidence interval.

The sampling distribution for Pearson's \(r_{p}\) is not normal. In order to attain confidence limits for \(r_{p}\) based on a standard normal distribution, we transform \(r_{p}\) using Fisher's Z transformation to get a quantity, \(z_{p}\), that has an approximate normal distribution. Then we can work with this value. Here is what is involved in the transformation.

Fisher's Z transformation is defined as

\(z_p=\dfrac{1}{2}log_e\left( \dfrac{1+r_p}{1-r_p} \right) \sim N\left( \zeta_p , sd=\dfrac{1}{\sqrt{n-3}} \right)\)

where

\(\zeta_p=\dfrac{1}{2}log_e\left( \dfrac{1+\rho_p}{1-\rho_p} \right)\)

We will use this to get the usual confidence interval, so, an approximate \(100(1 - \alpha)\%\) confidence interval for \(\zeta_{p}\) is given by \([z_{p, \frac{\alpha}{2}} , z_{p, 1-\frac{\alpha}{2}}]\), where

\(z_{p , \alpha/2}=z_p-\left( t_{n-3 , 1-\alpha/2}/\sqrt{n-3} \right) , z_{p , 1-\alpha/2}=z_p+\left( t_{n-3 , 1-\alpha/2}/\sqrt{n-3} \right)\)

But really what we want is an approximate \(100(1 - \alpha)\%\) confidence interval for \(\rho_{p}\) is given by \([r_{p, \frac{\alpha}{2}} , r_{p, 1- \frac{\alpha}{2}}]\), where

\(r_{p , \frac{\alpha}{2}}=\dfrac{exp(2z_{p , \alpha/2})-1}{exp(2z_{p , \alpha/2})+1},r_{p , 1-\alpha/2}=\dfrac{exp(2z_{p , 1-\alpha/2})-1}{exp(2z_{p , 1-\alpha/2})+1}\)

Again, you do not have to do this by hand. PROC CORR in SAS will do this for you but it is important to have an idea of what is going on.