Printer-friendly versionPrinter-friendly version

Actually the colonCA samples were measured on Affymetrix@ microarrays, which are 1-channel arrays. So the analysis of the paired samples should be handled as if the subjects are random blocks. The blocking factor is handled using the intrablock correlation, which is explained in Section 7.3 of the course notes.

One way to express the similarity of samples within blocks is to assume that there are two components of variability - the within block variability and the between block variability. A sample may be paired, in which case it is in a block of size 2 or unpaired, in which case it is in a block of size 1. All samples therefore have within block variability. Samples in the same block vary only by within block variability while samples in different blocks have both within and between block variability.

Another way of expressing the same idea is the intrablock correlation, which is the ratio of the within block variance to the total variance. Instead of estimating the 2 variance components, you then estimate the intrablock correlation and one of the variance components - usually the within block variability.

LIMMA uses the intrablock correlation and enforces a very severe shrinkage. It assumes that all the genes have the same intrablock correlation. However, they each have their own within block variability, which is computed using the Empirical Bayes method outlined in Section 6.1.

Computing the intrablock correlation is done in the LIMMA function duplicateCorrelation. The function takes as input the data, the blocking variable and the design matrix. We need the design matrix because we want to compute the residual correlation, so we need to include both the predictors (which are stored in the design matrix and are usually indicator variables for the treatments) and the blocks.

The first step, as with other shrinkage methods, is to compute the correlation for each gene. As with shrinkage of the variances, it is helpful to think about the population of correlations from which these correlations could come. It turns out that when the data are Normally distributed and all the genes have the same intrablock correlation \(\rho\), then \(Z=.5 log [(1+r)/(1-r)]=arctanh(r)\) is \(Normal(arctanh(\rho),1/sqrt(N-3))\) where \(N\) is the sample size. Under the assumption that the genes are independent (which is not quite true) with the same true intrablock correlation, the observed correlations are a sample from this distribution. So the arctanh transformed gene intrablock correlations are a sample from \(Normal(arctanh(\rho),1/sqrt(N-3))\).

This makes our life simple, because the only unknown for the distribution of the Z's is the population mean, and we know that the best estimator is the sample mean - i.e. the mean of the Z's. duplicateCorrelation computes the gene by gene correlations, transforms to the Z's and computes the mean on the arctanh scale. (Actually, duplicateCorrelation uses the 15% trimmed mean to avoid outliers in case the actual data are not Normally distributed.)

For the colonCA data we see that the genewise correlations are skewed.

 correlation for colonCA data

The Z's look quite Normally distributed. The variance should be \(1/sqrt(2000-3)\), so the SD of the sampling distribution should be about 0.15. The actual SD of the 2000 sample correlations is 0.26. So, the trimming helps with the excess variability which shows up on the histogram as set of very short bins on both sides of the main peak. The estimated mean correlation is \(0.43=(tanh(mean(atanh(r)))\).

 Z