2.7 - Pairing and Correlation

Both the permutation test and t-test already discussed require independent samples. We could use either test with the colon cancer data if we use all the normal samples, but do not use the tumor samples from these patients. However, this would be wasteful of our data. We will discuss mixed samples later. For now, we will discuss how to use only the paired samples.

Suppose all the patients provided both normal and tumor samples. The pairs of samples from the same patient have correlated expression. Dependent samples (leading to correlated data) occur quite often in bioinformatics. Dependence may be induced by the biology or the technology. For example, we may take several tissue samples from the same individual, or several biologically related individuals. We may house our plants or animals together, inducing environmental correlation. We might process our nucleic acid samples in batches inducing technical correlation. Typically these dependencies induce positive correlation, so that the data are less variable than they would be if the samples are independent. In a few situations, such as competition studies, negative correlations are induced so that the data are more variable than they would be if the samples are independent.

Suppose were only using the paired samples, that is the 22 patients that have both normal and tumor samples. We are still interested in the same hypothesis \(H_0: \mu_X-\mu_Y=0\) but the variability of the difference in sample means will be different than in independent samples. The reason for this is that the difference in sample means is the same as the sample mean of the expression differences. Let \(X_i\) and \(Y_i\) be the gene expression for the normal and tumor samples respectively in patient \(i\) and let \(D_i=X_i-Y_i\). Note that \(n_X=n_Y\). It is easy to see arithmetically that if all the samples are paired,

\[\bar{X}-\bar{Y}=\bar{D}.\]

Now back when we looked at correlation, we saw the formula:

\[Var(X-Y)=Var(X)+Var(Y)-2 SD(X)SD(Y)Corr(X,Y)\].

We can estimate this variance by computing the two sample variances and the sample correlation, or more simply by computing the sample variance of the \(D_i\)'s (and we should obtain the same answer).

In general expression of the same gene in two samples from the same patient will be positively correlated because if expression is higher than the population average in the patient, it should be higher than average in both samples and if it is lower than average in the patient it should be lower than average in both samples. So, \(Var(D)<Var(X)+Var(Y)\). What this means is that the variance of the sampling distribution of \(\bar{X}-\bar{Y} < \frac{Var(X)}{n_x}+\frac{Var(Y)}{n_Y}\).

If we want to use the permutation method, we want to keep the pairing and swap sample labels within pair. This can readily be done by the equivalent of tossing a fair coin for each sample - if H, then keep the original sample labels; if T, then switch the values for the normal and tumor samples. Compute the sampling distribution using the sample mean D for each of these randomly relabeled samples.

To do a t-test, we need an estimate of the variance of the sampling distribution of \(\bar{X}-\bar{Y}\). Fortunately, we have a direct estimate based on the D's. All we have to do is reduce our data to \(D_i\) for each patient. Consider the underlying population to be the differences, with mean \(\mu_D\) and variance Var(D). Then we just test whether the \(\mu_D=0\).

\(H_0: \mu_D=0\)
\(H_A: \mu_D \ne 0\)

We estimate \(\mu_D\) with \(\bar(D)\) and the SE of the sampling distribution of \(\bar(D)\) with \(\frac{S^2_D}{n_X}\). Finally we do a one sample t-test using D

\[t*=\frac{\bar{D}}{\sqrt{\frac{S^2_D}{n_X}}}\]

with \(n_X-1\) d.f. This is sometimes called the paired t-test.

With identical sample sizes \(n_X=n_Y\) the two sample t-test will have d.f. close to \(2n_X\) so the t-distribution will be more concentrated around 0. However, for the same value of \(\bar{X}-\bar{Y}=\bar{D}\) the t* value will be larger for the paired t-test, because of the smaller variance. Usually, but not always, this leads to a smaller p-value for tests using paired samples.

Below we have the actual outcomes for the two valid methods discussed so far - using the 22 normal samples and the 18 independent tumor samples or using the 22 paired samples.

Test using the 22 normal samples and the 18 independent tumor samples	Test using the 22 paired samples
Welch Two Sample t-test	One Sample t-test
t = -3.9224, df = 37.999, p-value = 0.0003553	t = -3.307, df = 21, p-value = 0.003354
95 percent confidence interval: (-1.3866216, -0.4425592)	95 percent confidence interval: (-1.2911223, -0.2941996)

The p-values and 95% confidence intervals are pretty similar in this case, but this will not be true for all the genes measured. These are both valid tests.

A test using all 62 samples as if all the samples were independent is not valid. A better test using all the data will be discussed when we discuss linear models.

Printer-friendly version

2.7 - Pairing and Correlation

Navigation

Start Here!

Lessons